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Description 

BACKGROUND OF THE INVENTION 

6 The present invention relates to the sequencing, fingerprinting, and mapping of polymers, particularly biological 
polymers. The inventions may be applied, for example, in the sequencing, fingerprinting, or mapping of polynucleotides. 

The relationship between structure and function of macromolecules is of fundamental importance in the under- 
standing of biological systems. These relationships are important to understanding, for example, the functions of 
enzymes, structural proteins, and signalling proteins, ways in which cells communicate with each other, as well as 

10 mechanisms of cellular control and metaboGc feedback. 

Genetic information is critical in continuation of life processes. Life is substantially informationally based and its 
genetic content controls the growth and reproduction of the organism and its complements. Polypeptides, which are 
critical features of all living systems, are encoded by the genetic material of the cell. In particular, the properties of 
enzymes, functional proteins, and structural proteins are determined by the sequence of amino acids which make them 

15 up. As structure and function are integrally related, many biological functions may be explained by elucidating the 
underlying the structural features which provide those functions. For this reason, it has become very important to deter- 
mine the genetic sequences of nucleotides which encode the enzymes, structural proteins, and other effectors of bio- 
logical functions. In addition to segments of nucleotides which encode polypeptides, there are many nucleotide 
sequences which are involved in control and regulation of gene expression. 

so The human genome project is directed toward determining the complete sequence the genome of the human 
organism. Although such a sequence would not correspond to the sequence of any specific individual, it would provide 
signfficant information as to the general organization and specffic sequences contained within segments from particular 
individuals. K would also provide mapping information which is very useful for further detailed studies. However, the 
need for highly rapid, accurate, and inexpensive sequencing technology is nowhere more apparent than in a demanding 

25 sequencing project such as this. To complete the sequencing of a human genome would require the determination of 
approximately 3x10 9 . or 3 billion base pairs. 

The procedures typically used today for sequencing include the Sanger dideoxy method, see, eg., Sanger et al. 
(1977) Proc. Natl. Acad. Sci. USA. 74:5463*5467, or the Maxam and Gilbert method, see, e.g., Maxam et al., (1980) 
Methods in Enzymokxiy. 65:499-559. The Sanger method utilizes enzymatic elongation procedures with chain termi- 

50 nating nucleotides. The Maxam and Gilbert method uses chemical reactions exhibiting specificity of reaction to gener- 
ate nucleotide specific cleavages. Both methods require a practitioner to perform a large number of complex manual 
manipulations. These manipulations usually require isolating homogeneous DNA fragments, elaborate and tedious pre- 
paring of samples, preparing a separating gel, applying samples to the gel, electrophoresing the sarrples into this gel, 
working up the finished gel, and analyzing the results of the procedure. 

ss Thus, a less expensive, highly reliable, and labor efficient means for sequencing biological macromolecules is 
needed. A substantial reduction in cost and increase in speed of nucleotide sequencing would be very much welcomed. 
In particular, an automated system would improve the reproducibility and accuracy of procedures. The present inven- 
tion satisfies these and other needs. 

40 SUMMARY OF THE INVENTION 

The present invention provides improved methods useful for de novo sequencing of an unknown polymer 
sequence, for verification of taown sequences, for fingerprinting polymers, and for mapping homologous segments 
within a sequence. By reducing the number of manual manipulations required and automating most of the steps, the 
45 speed, accuracy, and reliability of these procedures are greatly enhanced. 

The production of a substrate having a matrix of positionally defined regions with attached reagents exhforting 
known recognition specificity can be used for the sequence analysis of a polymer. Although most cfirectly applicable to 
sequencing, the present invention is also applicable to fingerprinting, mapping, and general screening of specific inter- 
actions. 

60 The present invention also provides a means to automate sequencing manipulations. The automation of the sub- 
strate production method and of the scan and analysis steps minimizes the need for human intervention This simplifies 
the tasks and promotes reproducibility. 

The present invention provides a composition comprising a plurality of positionally distinguishable sequence spe- 
cific reagents attached to a solid substrate, which reagents are capable of specifically binding to a predetermined 6ub- 

55 unit sequence of a preselected mutti-subunit length having at least three subunits, said reagents representing 
substantially all possible sequences of said preselected length. In some embodiments, the subunit sequence is a poly- 
nucleotide sequence. In other embodiments, the specific reagent is an oligonucleotide of at least about five nucleotides, 
preferably at least eight nucleotides, more preferably at least 12 nucleotides. Usually the specific reagents are all 
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attached to a single solid substrate, and the reagents comprise at least 3000 different sequences. In other embodi- 
ments, the reagents represents at least about 25% of the possible subsequences of said preselected length. Usually, 
the reagents are localized in regions of the substrate having a density of at least 25 regions per square centimeter, and 
often the substrate has a surface area of less than about 4 square centimeters. 

The present invention also provides methods tor analyzing a sequence of a polynucleotide, said method compris- 
ing the step of: 

a) exposing said polynucleotide to a composition as described. 

It also provides useful methods for identifying or comparing a target sequence with a reference (ie., fingerprinting), 
said method comprising the step of: 

a) exposing said target sequence to a composition as described; 

b) determining the pattern of positions of the reagents which specifically interact with the target sequence; and 

c) comparing the pattern with the pattern exhibited by the reference when exposed to the composition. 

By way of example and not limitation, such fingerprinting methods may be used for personal identification, genetic 
screening, identification of pathological conditions, determination of patterns of specffic gene expression, and others. 
The present invention also provides methods for sequencing a segment of a polynucleotide comprising the steps 

of: 

a) combining: 

0 a substrate comprising a plurality of chemically synthesized and posftonady distinguishable oligonucleotides 
capable of recognizing defined oligonucleotide sequences; and 

fi) a target polynucleotide; thereby forming high fidelity matched duplex structures of complementary subse- 
quences of known sequence; and 

b) determining which of said reagents have specifically interacted with subsequences in said target polynucleotide. 

In one embodiment the segment is substantially the entire length of said polynucleotide. 

In one embodiment, the substrates are beads. Preferably, the plurality of reagents comprise substantially all possi- 
ble subsequences of said preselected length found in said target In another embodiment the solid phase substrate in 
a single substrate having attached thereto reagents recognizing substantially ail possible subsequences of preselected 
length found in said target 

In another embodiment the method further comprises the step of analyzing a plurality of said recognized subse- 
quences to assemble a sequence of said target polymer. In a bead embodiment at least some of the plurality of sub- 
strates have one subsequence specific reagent attached thereto, and the substrates are coded to indicate the 
sequence specificity of said reagent 

The present invention also provides methods for sequencing a polymer, said methods comprising the steps of: 

a) preparing a plurality of reagents which each specif icaBy bind to a subsequence of preselected length; 

b) posffionalty attaching each of said reagents to one or more solid phase substrates, thereby sequence specific 
probes; 

c) combining said substrates with a target polymer whose sequence is to be determined; and 

d) determining which of said reagents have specifically interacted with subsequences in said target polymer. 

In one embodiment, the substrates are beads. Preferably, the plurality of reagents comprise substantially all possi- 
ble subsequences of said preselected length found in said target In another embodiment, the solid phase substrate in 
a 6ing!e substrate haying attached thereto reagents recognizing substantially all possible subsequences of preselected 
length found in said target. 

In another embodiment the method further comprises the step of analyzing a plurality of said recognized subse- 
quences to assemble a sequence of said target polymer. In a bead embodiment at least some of the plurality of sub- 
strates have one subsequence specific reagent attached thereto, and the substrates are coded to indicate the 
sequence specificity of said reagent 

The present invention also embraces a method of using a fluorescent nucleotide to detect interactions with oligo- 
nucleotide probes of known sequence, said method comprising: 
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. a) attaching said nucleotide to a target unknown polynucleotide sequence, and 
b) exposing said target polynucleotide sequence to a collection of posrtionally defined oligonucleotide probes of 
known sequences to determine the sequences of said probes which interact with said target. 

In a further refinement, an additional step is included of: 

a) collating said known sequences to determine the overlaps of said known sequences to determine the sequence 
of said target sequence 

A method of mapping a plurality of sequences relative to one another is also provided, the method corrprising: 

a) preparing a substrate having a plurality of posrtionally attached sequence specific probes are attached; 

b) exposing each of said sequences to said substrate, thereby determining the patterns of interaction between said 
sequence specific probes and said sequences; and 

c) determining the relative locations of said sequence specific probe interactions on said sequences to determine 
the overlaps and order of said sequences. 

In one refinement the sequence specific probes ace oligonucleotides, applicable to where the target sequences 
are nucleic acid sequences. 

In the nucleic acid sequencing application, the steps of the sequencing process comprise: 

a) producing a matrix substrate having known posrtionally defined regions of known sequence specific oligonucle- 
otide probes; 

b) hybridizing a target polynucleotide to the positions on the matrix so that each of the positions which contain oli- 
gonucleotide probes complementary to a sequence on the target hybridize to the target moleoJe; 

c) detecting which positions have bound the target thereby determining sequences which are found on the target; 
and 

d) analyzing the known sequences contained in the target to determine sequence overlaps and assembling the 
sequence of the target therefrom. 

The enablement of the sequencing process by hybridization is based in large part upon the ability to synthesize a 
large number (ag.. to virtually saturate) of the possible overlapping sequence segments and distinguishing those 
probes which hybridize with fidelity from those which have mismatched bases, and to analyze a highly complex pattern 
of hybridization results to determine the overlap regions. 

The detecting of the positions which bind the target sequence would typically be through a fluorescent label on the 
target Although a fluorescent label is probably most convenient other sorts of labels, ag., radioactive, enzyme linked, 
optically detectable, or spectroscopic labels may be used. Because the oligonucleotide probes are positional defined, 
the location of the hybridized duplex will directly translate to the sequences which hybridize. Thus, upon analysis of the 
positions provides a collection of subsequences found within the target sequence These subsequences are matched 
with respect to their overlaps so as to assemble an intact target sequence. 

BRIEF DESCRIPTION OF THE FIGURES 

Fig. 1 illustrates a flow chart for sequence, fingerprint or mapping analysis. 
Fig. 2 illustrates the proper function of a VLSIPS nucleotide synthesis. 
Fig. 3 illustrates the proper function of a VLSIPS dinucleotide synthesis 
Fig. 4 illustrates the process of a VLSIPS trinucleotide synthesis. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

i Overall Description 

A. general 

B. VLSIPS substrates 

C. binary masking 



D. applications 

E. detection methods and apparatus 

F. data analysis 

II. Theoretical Analysis 

A. simple n-mer structure; theory 

B. complications 

III. Polynucleotide Sequencing 

A. preparation of substrate matrix 

B. labeling target polynucleotide 

C. hybridization concf tions 

D. detection; VLSIPS scanning 

E. analysis 

F. substrate reuse 

IV. Rngerprinting 

A. general 

B. preparation of substrate matrix 

C. labeling target nucleotides 

D. hybridization conditions 

E. detection; VLSIPS scanning 

F. analysts 

G. substrate reuse 

H. other polynucleotide aspects 

V Mapping 

A. general 

B. preparation of substrate matrix 

C. labefing 

D. hybridization/specific interaction 

E. detection 
R analysis 

a substrate reuse 

VI. Additional Screening 

A. specific interactions 

B. sequence comparisons 

C. categorizations 

D. statistical correlations 

VII. Formation of Substrate 

A. instrumentation 

B. binary masking 

C. synthetic methods 

D. surface immobilization 

VIII. Hybridization/Specific Interaction 

A. general 

B. important parameters 
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IX. Detection Methods 

A. labeling techniques 

B. scanning system 

s 

X. Data Analysis 

A. general 

B. hardware 
10 C. software 

XI. Substrate Reuse . 

A. removal of label 

75 B. storage and preservation 

C. processes to avoid degradation of oligomers 

XII. Integrated Sequencing Strategy 

20 A. initial mapping strategy 

B. selection of smaller clones 

XIII. Commercial Applications 

25 A. sequencing 

Bl fingerprinting 

C. mapping 

I. OVERALL DESCRIPTION 

50 

A. General 

The present invention relies in part on the abifity to synthesize or attach specific recognition reagents at known 
locations on a substrate, typically a single substrate. In particular, the present invention provides the ability to prepare 

55 a substrate having a very high density matrix pattern of positional^ defined specific recognition reagents. The reagents 
are capable of interacting with their specific targets while attached to the substrate, e.g., solid phase interactions, and 
by appropriate labeling of these targets, the sites off the interactions between the target and the specific reagents may 
be derived. Because the reagents are posrtionally defined, the sites of the interactions will define the specificity of each 
interaction. As a result, a map of the patterns of interactions with specific reagents on the sitoctrate is convertible into 

40 information on the specific interactions taking place, eg., the recognized features. Where the specific reagents recog- 
nize a large number of possible features, this system allows the determination of the combination of specific interactions 
which exist on the target molecule Where the number of features is sufficiently large, the Identical same combination, 
or pattern, of features is sufficiently unlfcely that a particular target molecule may often be uniquely defined by its fea- 
tures. In the extreme the features may actually be the subunit sequence of the target molecule, and a given target 

45 sequence may be iniquely defined by its combination of features. 

In particular, the methodology is applicable to sequencing polynucleotides. The specific sequence recognition rea- 
gents wQ! typically be oligonucleotide probes which hybridize with specificity to subsequences found on the target 
sequence A sufficiently large number of those probes allows the fingerprinting of a target polynucleotide or the relative 
mapping of a collection of target polynucleotides, as descrtoed in greater detail below. 

60 In the high resolution fingerprinting provided by a saturating collection of probes which include all possble subse- 
quences of a given size, eg., 10-mers, collating of all the subsequences and determination of specrflc overlaps wfll be 
derived and the entire sequence can usually be reconstructed. 

Sequence analysis may take the form of complete sequence determination, to the level of the sequence of individ- 
ual subunrts along the entire length of the target sequence. Sequence analysis also may take the form of sequence 

55 homology, eg., less than absolute subunit resolution, where "similarity" in the sequence will be detectable, or the form 
of selective sequences of homology interspersed at specific or irregular locations. 

In either case, the sequence is determinable at selective resolution or at particular locations. Thus, the hybridiza- 
tion method will be useful as a means for identification, e.g., a "fingerprint", much like a Southern hybridization method 
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is used, ft is also useful to map partkaJar target sequences. 

B. VLSIPS Substrates 

The invention is enabled by the development of technology to prepare substrates on which specific reagents may 
be either positionally attached or synthesized. In particular, the very large scale immobflized polymer synthesis 
(VLSIPS) technology allows for the very high density production of an enormous diversity of reagents mapped out in a 
known matrix pattern on a substrate. These reagents specif ically recognize subsequences in a target polymer and bind 
thereto, producing a map of positionally defined regions of interaction. These map positions are converttole into actual 
features recognized, and thus would be present in the target molecule of interest 

As indicated, the sequence specific recognition reagents will often be oligonucleotides which hybridize with fidelity 
and discrimination to the target sequence 

In the generic sense, the VLSIPS technology allows the production of a substrate with a high density matrix of posi- 
tionally mapped regions with specific recognition reagents attached at each distinct region. By use of protective groups 
which can be posrtionaOy removed, or added, the regions can be activated or deactivated for addition of particular rea- 
gents or compounds. Details of the protection are described below and in PCT publication no. WO90/1 5070, published 
December 13, 1990. In a preferred embodiment, photosensitive protecting agents will be used and the regions of acti- 
vation or deactivation may be controlled by electro-optical and optical methods, similar to many of the processes used 
in semiconductor wafer and chip fabrication. 

In the nucleic acid nucleotide sequencing application, a VLSIPS substrate is synthesized having positionally 
defined oligonucleotide probes. See PCT publication no. WO90/15070, published December 13, 1990; and U.S.S.N. 
07/624.120. f iled December 6, 1990. By use of masking technology and photosensitive synthetic subunrts, the VLSIPS 
apparatus allows for the stepwise synthesis of polymers according to a positional^ defined matrix pattern. Each oligo- 
nucleotide probe will be synthesized at known and defined positional locations on the substrata This forms a matrix pat- 
tern of known relationship between position and specificity of interaction. The VLSIPS technology allows the production 
of a very large number of different oligonucleotide probes to be simultaneously and automatically synthesized including 
numbers in excess of about 10 2 , 10 3 . 10 4 , 10 s . 10 6 . or even more, and at densities of at least about irj 2 , K^Axn 2 , 
lO^cm 2 , lO^cm 2 and up to lO^cm 2 or more. This application discloses methods for synthesizing polymers on a 6flicon 
or other suitably derivatized substrate, methods and chemistry for synthesizing specific types of biological polymers on 
those substrates, apparatus for scanning and detecting whether interaction has occurred at specific locations on the 
substrate, and various other technologies related to the use of a high density very large scale immobilized polymer sub- 
strate. In particular, sequencing, f hgerprinting, and mapping applications are discussed herein in detail, though related 
technologies are described in U.S.S.N. 07/624,120, fled December 6, 1990; and PCT/US91/02989, ffled May 1, 1991 , 
each of which is hereby incorporated herein by reference. 

The regions which define particular reagents will usually be generated by selective protecting groups which may be 
activated or deactivated. Typically the protecting group wfll be bound to a monomer subunit or spatial region, and can 
be spatially affected by an activator, such as electromagnetic radiation. Examples of protective groups with utility herein 
include nitroveratryt oxycarbonyl (NVOQ. nitrobenzyf oxycarbony (NBOC) or cua-dimethyl-dimethaxybenzyl cxycarbo- 
nyl (DEZ). 

C. Binary Masking 

In fact, the means for producing a substrate useful for these techniques are explained in U.S.S.N. 07/492,462 
(VLSIPS CIP), which is hereby incorporated herein by reference However, there are various particular ways to optimize 
the synthetic processes. Many of these methods are described in U.S.S.N. 07/624,120. 

Briefly, the binary synthesis strategy refers to an ordered strategy for parallel synthesis of diverse polymer 
sequences by sequential addition of reagents which may be represented by a reactant matrix, and a switch matrix, the 
product of which is a product matrix. A reactant matrix is a 1 x n matrix of the building blocks to be added. The switch 
matrix is all or a sifoset of the binary numbers from 1 to n arranged in columns. In preferred embodiments, a binary 
strategy is one in which at least two successive steps illuminate half of a region of Interest on the substrate. In most 
preferred embodiments, binary synthesis refers to a synthesis strategy which also factors a previous addition 6tep. For 
example, a strategy in which a switch matrix for a masking strategy halves regions that were previously Olurrvnated, llu- 
minating about half of the previously illuminated region and protecting the remaining half (while also protecting about 
half of previously protected regions and illuminating about half of previously protected regions). It will be recognized that 
binary rounds may be interspersed with non-binary rounds and that only a portion of a substrate may be subjected to 
a binary scheme, but will still be considered to be a binary masking scheme within the definition herein. A binary "mask- 
ing" strategy is a binary synthesis which uses light to remove protective groups from materials for addition of other 
materials such as nucleotides. 
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tn particular, this procedure provides a simplified and highly efficient method for saturating all possible sequences 
of a defined length polymer. This masking strategy is also particularly useful in producing all possible oligonucleotide 
sequence probes of a given length. 

5 D. Applications 

The technology provided by the present invention has very broad applications. Although described specifically for 
polynucleotide sequences, similar sequencing, fingerprinting, mapping, and screening procedures may be applied to 
polypeptide, carbohydrate, or other polymers. This may be for de novo sequencing, or may be used in conjunction with 

10 a second sequencing procedure to provide independent verification. See. ag., (1988) Science 242:1245. For example, 
a large polynucleotide sequence defined by either the Maxam and Gilbert technique or by the Sanger technique may 
be verified by using the present invention. 

In addition, by selection of appropriate probes, a polynucleotide sequence can be fingerprinted. Fingerprinting is a 
less detailed sequence analysis which usually involves the characterization of a sequence by a combination of defined 

15 features. Sequence fingerprinting is particularly useful because the repertoire of possible features which can be tested 
is virtually infinite. Moreover, the stringency of matching is also variable depending upon the application. A Southern 
Blot analysis may be characterized as a means of simple fingerprint analysts. 

Fingerprinting analysis may be performed to the resolution of specific nucleotides, or may be used to determine 
homologies, most commonly for large segments. In particular, an array of oligonucleotide probes of virtually any work- 

20 able size may be posKionafly localized on a matrix and used to probe a sequence for either absolute complementary 
matching, or homology to the desired level of stringency using selected hybridization conditions. 

In addition, the present invention provides means for mapping analysis of a target sequence or sequences. Map- 
ping wffl usually involve the sequential ordering of a plurality of various sequences, or may involve the localization of a 
particular sequence within a plurality of sequences. This may be achieved by immobilizing particular large segments 

25 onto the matrix and probing with a shorter sequence to determine which of the large sequences contain that 'smaller 
sequence. Alternatively, relatively shorter probes of known or random sequence may be immobilized to the matrix and 
a map of various different target sequences may be determined from overlaps. Principles of such an approach are 
described in some detail by Evans et al. (1 989) "Physical Mapping of Complex Genomes by Cosmid Multiplex Analysts," 
Proa Natl. Acad. Sd. USA 86:5030-5034; Michiels et al. (1987) "Molecular Approaches to Genome Analysis: A Strat- 

so egy for the Construction of Ordered Overlap Clone Lferaries." CABIOS 3203-210; CHsen et al. (1986) "Random-Gone 
Strategy for Genomic Restriction Mapping in Yeast" Proc. Natl. Aca d. Sci. USA 83:7826-7830: Craig, et al. (1990) 
"Ordering of Cosmid Clones Covering the Herpes Simplex Virus Type I (HSV-I) Genome: A Test Case for Fingerprinting 
by Hybrkfization " Nuc Acids Res. 185653-2660; and Coulson, et al. (1986) Toward a Physical Map of the Genome of 
the Nematode Caenorhabditis elegans," Proc. Natl. Acad. Sd. USA 83:7821-7825; each of which is hereby incorpo- 

35 rated herein by reference 

Fingerprinting analysis also provides a means of identiTicatioa In addition to its value in apprehension of criminals 
from whom a biological sample, e.g., blood, has been collected, fingerprinting can ensure personal iderttffication for 
other reasons. For example, it may be useful for iderttffication of bodies in tragedies such as fire, flood, and vehicle 
crashes. In other cases the identification may be useful in identification of persons suffering from amnesia, or of missing 

40 persons. Other forensics applications include establishing the identity of a person, e.g.. miGtary identification "dog tags", 
or may be used in identifying the source of particular biological samples. Fingerprinting technology is described, e.g., 
in Carrano. et al. (1989) "A High-Resolution, Fluorescence-Based, Semi-automated method for DNA Fingerprinting," 
Genomics 4: 129-136, which is hereby incorporated herein by reference. See, ag., table I, for nucleic acid applications. 

45 
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TABLE I 

VLSIPS PROJECT IN NUCLEIC ACIDS 

I. Construction of Chips 

II . Applications 

A. Sequencing 

1. Primary sequencing 

2. Secondary sequencing (sequence checking) 

3. Large scale mapping 

4 . Fingerprinting 

B. Duplex/Triplex formation 

1. Antisense 

2. Sequence specific function modulation 
(e.g. promoter inhibition) 

C. Diagnosis 

1. Genetic markers 

2. Type markers 

a. Blood donors 

b. Tissue transplants 

D. Microbiology 

1. • Clinical microbiology 

2. Food microbiology 

III. Instrumentation 

A. Chip machines 

B. Detection 

IV. Software Development 

A. Instrumentation software 

B. Data reduction software 

C. , Sequence analysis software 

The fingerprinting analysis may be used to perform various types of genetic screening. For example, a single sub- 
strate may be generated with a plurality of screening probes, allowing lor the simultaneous genetic screening for a large 
number of genetic markers. Thus, prenatal or diagnostic screening can be simplified, economized, and made more gen- 
erally accessible. 

In addition to the sequencing, fingerprinting, and mapping applications, the present invention also provides means 
fa determining specificity of interaction with particular sequences. Many of these applications were described in 
U.S.S.N. 07/362,901 (VLSIPS parent). U.S.S.N. 07/492,462 (VLSIPS CIP). U.S.S.N. 07/435.316 (caged biotin parent), 
and U.S.S.N. 07/612,671 (caged biotin CIP). 
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E. Detection Methods and Apparatus 

An appropriate detection method applicable to the selected labeling method can be selected. Suitable labels 
include radionucleotides, enzymes, substrates, cofactors, inhibitors, magnetic particles, heavy metal atoms, and parte- 
5 ularly f luorescers. chemiluminescers. and spectroscopic labels. Patents teaching the use of such labels include U.S. 
Patent Nos. 3,817,837; 3,850.752; 3,939,350; 3,996.345; 4,277,437; 4,275.149; and 4,366.241. 

With an appropriate label selected, the detection system best adapted for high resolution and high sensitivity detec- 
tion may be selected. As indicated above, an optically detectable system, e.g.. fluorescence or chemilurninescence 
would be preferred. Other detection systems may be adapted to the purpose. e.g. ( electron microscopy, scanning elec- 
io tron microscopy (SEM), scanning tunneling electron microscopy (STEM), infrared microscopy, atomic force microscopy 
(AFM), electrical condutance, and image plate transfer. 

With a detection method selected, an apparatus for scanning the substrate win be designed. Apparatus, as 
described In PCT publication na W090715070. published December 13, 1990; or U.S.S.N. 07/624,120, filed December 
6, 1990, are particularly appropriate. Design modifications may also be incorporated therein. 

15 

F. Data Analysis 

Data is analyzed by processes similar to those described below in the section describing theoretical analysis. More 
eff icient algorithms wfll be mathematically devised, and will usually be designed to be performed on a computer. Various 

so computer programs which may more quickly or efficiently make measurement samples and distinguish signal from 
noise wiD also be devised. See, particularly, U.S.S.N. 07/624,120. 

The initial data resulting from the detection system is an array of data indicative of fluorescent intensity versus loca- 
tion on the substrata The data are typically taken over regions substantially smaller than the area in which synthesis of 
a given polymer has taken place. Merely by way of example, if polymers were synthesized in squares on the substrate 

25 having dimensions of 500 microns by 500 microns, the data may be taken over regions having dimensions of 5 microns 
by 5 microns. In most preferred embodiments, the regions over which florescence data are taken across the substrate 
are less than about 1/2 the area of the regions in which individual polymers are synthesized, preferably less than 1/10 
the area In which a 6ingle polymer is synthesized, and most preferably less than 1/100 the area in which a 6ingle poly- 
mer ts synthesized. Hence, within any area in which a given polymer has been synthesized, a large number of fluores- 

so cence data points are collected. 

A plot of number of pixels versus intensity for a scan should bear a rough resemblance to a bed curve, but spurious 
data are observed, particularly at higher intensities. Since it is desirable to use an average of fluorescent intensity over 
a given synthesis region in determining relative binding affinity, these spurious data will tend to undesirably skew the 
data. 

ss Accordingly, in one embodiment of the invention the data are corrected for removal of these spurious data points, 
and an average off the data points is thereafter utilized in determining relative binding efficiency. In general the data are 
fitted to a base curve and statistically measures are used to remove spurious data. 

In an additional analytical tool, various degeneracy reducing analogues may be incorporated in the hybridization 
probes. Various aspects of this strategy are described, ag.. in Macevicz, S. (1990) PCT publication number WO 

40 90/04652, which is hereby incorporated herein by reference. 

II. THEORETICAL ANALYSIS 

The principle of the hybridization sequencing procedure is based, in part, upon the abiGty to determine overlaps of 
. 45 short segments. The VLSIPS technology provides the ability to generate reagents which will saturate the possWe short 
subsequence recognition possibilities. The principle is most easily Uustrated by using a binary sequence, 6uch as a 
sequence of zeros and ones. Once having illustrated the application to a binary alphabet, the principle may easily be 
understood to encompass three letter, four letter, five or more letter, even 20 letter alphabets. A theoretical treatment of 
analysis of subsequence information to reconstruction of a target sequence is provided, e.e,, in Lysov. Yu.. et al. (1988) 
so DoMadv Akademi. Nauk SSR 303:1508-151 1 : Khrooto K., etal. f1S89l FEBS Letters 256:1 18-122: Pevzner. R (1989) 
J. of Biomolecular Structure and Dynamics 7:63-69; and Drmanac, R. etal. (1989) Genomics 4:1 14-128; each of which 
is hereby incorporated herein by reference. 

The reagents for recognizing the subsequences will usually be specific for recognizing a particular polymer subse- 
quence anywhere within a target polymer. It is preferable that conditions may be devised which allow absolute discrim- 
55 i nation between high fidelity matching and very low levels of mismatching. The reagent interaction will preferably exhibit 
no sensitivity to flanWng sequences, to the subsequence position within the target, or to any other remote structure 
within the sequence. 
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A. Simple n-mer Structure: Theory 

1. Simple two letter al phabet: example 

5 A simple example is presented below of how a sequence of ten digits comprising zeros and ones would be 
sequenceable using short segments of five digits. For example, consider the sample ten digit sequence: 
1010011100. 

A VLSIPS substrate could be constructed, as discussed elsewhere, which would have reagents attached in a defined 
matrix pattern which specif ically recognize each of the possible five digit sequences of ones and zeros. The number of 

10 possible five digit subsequences is 2 s = 32. The number of possible different sequences 10 cfgits long is 2 10 = 1,024. 
The five contiguous digit subsequences within a ten digit sequence number six. i.e.. positioned at digits 1-5, 2-6, 3-7, 
4-8, 5-9, and 6-10. It will be noted that the specific order of the digits in the sequence is important and that the order is 
directional, e.g., running left to right versus right to left. The first five digit sequence contained In the target sequence is 
10100. The second is 01001, the third is 1001 1, the fourth is 0011 1, the fifth is 01 1 10, and the sixth is 11100. 

15 The VLSIPS substrate would have a matrix pattern of positional ly attached reagents which recognize each of the 
different 5-mer subsequences. Those reagents which recognize each of the 6 contained 5-mers will bind the target and 
a label allows the positional determination of where the sequence specific interaction has occurred. By correlation of 
the position in the matrix pattern, the corresponding bound subsequences can be determined. 

In the above-mentioned sequence, six different 5-mer sequences would be determined to be present They would 

so be: 

10100 
OlOOl 
10011 

25 00111 

OHIO 
11100 



50 

Any sequence which contains the first five digit sequence, 10100. already narrows the number of possible 
sequences (e.g., from 1024 possible sequences) which contain it to less than about 192 poss&e sequences. 

This 1 92 is derived from the observation that with the subsequence 1 01 00 at the far left of the sequence, in posi- 
tions 1-5, there are only 32 possible sequences. Lfrewfse, for that particular subsequence in positions 2-6, 3-7, 4-8, 5- 

ss 9, and 6-10. So. to sum up all of the Sequences that could contain 10100, there are 32 for each position and 6 positions 
for a total of about 192 possHe sequences. However, some of these 10 digit sequences will have been counted twice. 
Thus, by virtue of containing the 10100 subsequence, the number of possble 10-mer sequences has been decreased 
from 1024 sequences to less than about 192 sequences. 

In this example, not only do we know that sequence contains 10100, but we also know that it contains the second 

40 five character sequence, 01001 . By virtue of knowing that the sequence contains 10100, we can look specifically to 
determine whether the sequence contains a subsequence of five characters which contains the four leftmost digits plus 
a next digit to the left. For example, we would look for a sequence of X1010, but we fmd that there is none. Thus, we 
know that the 10100 must be at the left end of the 10-mer. We would also look to see whether fre sequence contains 
the rightmost four digits plus a next digit to the right e.g., 01 00X. We find that the sequence also contains the sequence 

45 01 001 , and that X is a 1 . Thus, we know at least that our target sequence has an overlap of 0100 and has the left ter- 
minal sequence 101001. 

Applying the same procedure to the second 5-mer, we also know that the sequence must include a sequence of 
five digits having the sequence 1001 Y where Y must be either 0 or 1 . We look through the fragments and we see that 
we have a 1 001 1 sequence within our target thus Y is also 1 . Thus, we would know that our sequence has a sequence 
60 of the fret seven being 1 01001 1 . 

Moving to the next 5-mer, we know that there must be a sequence of 0011 Z, where Z must be either 0 or 1. We look 
at the fragments produced above and see that the target sequence contains a 001 1 1 subsequence and Z is 1 . Thus, 
we know the sequence must start with 10100111. 

The next 5-mer must be of the sequence 01 1 1 W where W must be 0 or 1. Again, looking up at the fragments pro- 
55 duced, we see that the target sequence contains a 01 1 10 subsequence, and W is a 0. Thus, our sequence to this point 
is 101001110. We know that the last 5-mer must be either 11100 or 11101. Looking above, we see that it is 1 1 100 and 
that must be the last of our sequence. Thus, we have determined that our sequence must have been 101001 1 100. 

However, it will be recognized from the example above wfth the sequences provided therein, that the sequence 
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analysis can start with any known positive probe subsequence. The determination may be performed by moving linearly 
along the sequence checking the known sequence with a limited number of next positions. Given this possibility, the 
sequence may be determined, besides by scanning aO possble oligonucleotide probe positions, by specifically looking 
only where the next possible positions would be. This may increase the complexity of the scanning but may provide a 

5 longer time span dedicated towards scanning and detecting specific positions of interest relative to other sequence pos- 
sibilities. Thus, the scanning apparatus could be set up to work its way along a sequence from a given contained oligo- 
nucleotide to only look at those positions on the substrate which are expected to have a positive signal. 

It is seen that given a sequence, it can be deconstructed into rvmers to produce a set of internal contiguous sub- 
sequences. From any given target sequence, we would be able to determine what fragments would result The hybrid- 

10 ization sequence method depends, in part upon bang able to work in the reverse, from a set of fragments of known 
sequences to the full sequence. In simple cases, one is able to start at a single position and work in either or both direc- 
tions towards the ends of the sequence as illustrated in the example. . 

The number of possible sequences of a given length increases very quickly with the length of that sequence. Thus, 
a 10-mer of zeros and ones has 1024 possibilities, a 12-mer has 4096. A 20-mer has over a minion possibles, and a 

is 30-mer has over a billion. However, a given 30-mer has. at most 26 different internal 5-mer sequences. Thus, a 30 
character target sequence having over a minion possible sequences can be substantially defined by only 26 different 5- 
mers. It will be recognized that the probe oligonucleotides will preferably; but need not necessarily, be of identical length, 
and that the probe sequences need not necessarily be contiguous in that the overlapping subsequences need not differ 
by only a single sifcunit Moreover, each position of the matrix pattern need not be homogeneous, but may actually con- 

20 tain a plurality of probes of known sequence. In addition, although allot the possible subsequence specifications would 
be preferred, a less than full set of sequences specifications could be used. In particular, although a substantial fraction 
will preferably be at least about 70%, it may be less than that About 20% would be preferred, more preferably at least 
about 30% would be desired Higher percentages would be especially preferred. 

25 2. Example pf fpyr letter alphabet 

A four letter alphabet may be conceptualized in at least two different ways from the two letter alphabet Oneway, is 
to consider the four possible values at each position and to analogize in a similar fashion to the binary example each of 
the overlaps. A second way is to group the binary digits into groups. 

so Using the first means, the overlap comparisons are performed with a four letter alphabet rather than a two letter 
alphabet Then, in contrast to the binary system with 10 positions where 2 10 ° 1024 possible sequences, in a 4-char- 
acter alphabet with 10 positions, there will actually be 4 10 «= 1,048,576 possible sequences. Thus, the complexity of a 
four character sequence has a much/larger number of possible sequences compared to a two character sequence. 
Note, however, that there are still only 6 different internal 5-mers. For simplicity, we shall examine a 5 character string 

ss with 3 character subsequences. Instead of only 1 and 0, the characters may be designated, e.g.. A, C, G, and T. Let us 
take the sequence GGCTA. The 3-mer subsequences are: 



Given these subsequences, there is one sequence, or at most only a few sequences which would produce that combi- 
45 nation of subsequences, i.e., GGCTA. 

Alternatively, with a four character universe, the binary system can be looked at in pairs of digits. The pairs would 
be 00, 01, 10, and 1 1 . In this manner, the earlier used sequence 101001 1 100 is looked at as 10,10,01 ,11,00. Then the 
first character of two digits is selected from the possible universe of the four representations 00, 0 1 , 10, and 1 1 . Then 
a probe would be in an even number of digits, e.g., not five d gits, but three pairs of digits or six digits. A 6im3ar corn- 
so parison is performed and the possfole overlaps determined. The 3-pair subsequences are: 



and the overlap reconstruction produces 10.10,01,11,00. 

The latter of the two conceptual views of the 4 letter alphabet provides a representation which is Gimilar to what 



40 



GGC 
GCT 
CTA 



55 



10,10, 01 
10,01,11 

01 r ll,00 
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would be provided in a digital computer. The applicability to a four nucleotide alphabet is easily seen by assigning, e.g., 
00 to A, 01 to C. 10 to G, and 1 1 to T. And. in fact if such a correspondence is used, both examples for the 4 character 
sequences can be seen to represent the same target sequence. The applicability of the hybridization method and its 
analysis for determining the ultimate sequence is easily seen if A is the representation of adenine, C is the representa- 
s tionof cytosine. G is the representation of guanine, and T is the representation of thymine or uracil. 

B. Complications 

Two obvious complications exist with the method of sequence analysis by hybridization. The first results from a 
w probe of inappropriate length whQe the second relates to internally repeated sequences. 

The first obvious compGcation is a problem which arises from an inappropriate length of recognition sequence, 
which causes problems with the specificity of recognition. For example, if the recognized sequence is too short every 
sequence which is utilized will be recognized by every probe sequence This occurs, eg., in a binary system where the 
probes are each of sequences which occur relatively frequently, eg., a two character probe for the binary system. Each 
15 possible two character probe would be expected to appear Vi of the time in every single two character position. Thus, 
the above sequence example would be recognized by each of the 00. 10, 01 . and 1 1 . Thus, the sequence information 
is virtually lost because the resolution is too low and each recognition reagent specifically binds at multiple sites on the 
target sequence. 

The number of different probes which bind to a target depends on the relationship between the probe length and 

20 the target length. At the extreme of short probe length, the just mentioned problem exists of excessive redundancy and 
lack of resolution. The lack of stabaity in recognition will also be a problem with extremely short probes. At the extreme 
of long probe length, each entire probe sequence is on a different position of a substrate. However, a problem arises 
from the number of possible sequences, which goes up dramatically with the length of the sequence. Also, the specifi- 
city of recognition begins to decrease as the contribution to binding by any particular subunit may become sufficiently 

25 low that the system tails to distinguish the fidelity of recognition. Mismatched hybridization may be a problem with the 
polynucleotide sequencing applications, though the fingerprinting and mapping appBcations may not be so strict in their 
fidelity requirements. As indicated above, a thirty position binary sequence has over a million possfole sequences, a 
nurrtoer which starts to become unreasonably large in its required number of different sequences, even though the tar- 
get length is still very short. Preparing a substrate with all sequence possibilities for a long target may be extremely drf- 

so ficutt due to the many different oligomers which must be synthesized. 

The above example illustrates how a long target sequence may be reconstructed with a reasonably small number 
of shorter subsequences. Since the present day resolution of the regions of the substrate having defined oligomer 
probes attached to the substrate approaches about 10 microns by 10 microns for resolvable regions, about 10 6 , or 1 
million, positions can be placed on a one centimeter square substrata However, high resolution systems may havepar- 

35 ticular disadvantages which may be outweighed using the lower density substrate matrix pattern. For this reason, a suf- 
ficiently large number of probe sequences can be utiBzed so that any given target sequence may be determined by 
hybridization to a relatively smaD number of probes. 

A second convocation relates to convergence of sequences to a single subsequence. This will occur when a par- 
ticular subsequence is repeated in the target sequence. This problem can be addressed in at least two different ways. 

40 The first and simpler way, is to separate the repeat sequences onto two different targets. Thus, each single target will 
not have the repeated sequence and can be analyzed to its end This solution, however, conplicates the analysis by 
requiring that some means for cutting at a srte between the repeats 

want to have two Intermediate cut points so that the intermediate region can also be sequenced in both drections 
across each of the cut points. This problem is inherent in the hybridization method for sequencing but can be minimized 

45 by using a longer known probe sequence 60 that the frequency of probe repeats is decreased. 

Knowing the sequence of flanking sequences of the repeat will simplify the use of polymerase chain reaction (PCR) 
or a similar technique to further definitively determine the sequence between sequence repeats. Probes can be made 
to hybridize to those known sequences adjacent the repeat sequences, thereby producing new target sequences for 
analysis. See. eg.. Innis et a), (eds.) (1990} PCR Protocols : A Guide to Methods and Applications. Academic Press; 

so and methods for synthesis of oligonucleotide probes, see. e.g.. Gait (1984) Oligonucleotid e Synthesis: A Practical 
Approach. IRL Press, Oxford. 

Other means for dealing with convergence problems include using particular longer probes, and using degeneracy 
reducing analogues, see, eg., Macevicz, S. (1990) PCT publication number WO 90/04652. which is hereby incorpo- 
rated herein by reference. By use of stretches of the degeneracy reducing analogues with other probes in particular 

55 combinations, the number of probes necessary to fuDy saturate the possible oligomer probes is decreased. For exam- 
ple, with a stretch of 1 2-mers having the central 4-mer of degenerate nucleotides, in combination with all of the possble 
8-mers, the collection numbers twice the number of possible 8-mers, e.g. 65.536 + 65.536 -131 ,072, but the population 
provides screening equivalent to all possible 12-mers. 
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By way of further explanation, ad possible oligonucleotide 8-mers may be depicted in the fashion: 
N1 -N2-N3-N4-N5-N6-N7-N8. 

in which there are 4 s - 65.536 possible 8-mers. As described in U.S.S.N. 07/624,120. producing all possible 8-mers 
requires 4x8 = 32 chemical binary synthesis steps to produce the entire matrix pattern of 65,536 8-mer possibilities. By 
s incorporating degeneracy reducing nucleotides. D's. which hybridize nonselective^ to any corresponding complemen- 
tary nucleotide, new oligonucleotides 12-mers can be made in the fashion: 

N1 -N2-N3-N4-D-D-D-D-N5-N6-N7-N8. 
in which there are again, as above, only 4 8 = 65.536 possible M2-mers". which in reality only have 8 different nucle- 
otides. 

w However, it can be seen that each possible 12-mer probe could be represented by a grotp of the two 8-mer types. 
. Moreover, repeats of less than 12 nucleotides would not converge, or cause repeat problems in the analysis. Thus, 
instead of requiring a collection of probes corresponding to all 1 2-mers, or 4 12 = 1 6.777,21 6 different 1 2-mers, the same 
information can be derived by making 2 sets of "8-mers" consisting of the typical 8-mer collection of 4 8 = 65,536 and 
the "12-mer* set with the degeneracy reducing analogues, also requiring making 4 8 = 65,536. The combination of the 

is two sets, requires making 65.536 + 65.536 = 131 ,072 different molecules, but giving the information of 16.777,216 mol- 
ecules. Thus, incorporating the degeneracy reducing analogue decreases the number of molecules necessary to get 
12-mer resolution by a factor of about 128-fold. 

III. POLYNUCLEOTIDE SEQUENCING 

20 

In principle, the making of a substrate having a positional^ defined matrix pattern of all possible oligonucleotides 
of a given length involves a conceptually simple method of synthesizing each and every different possible oligonucle- 
otide, and affixed to a definable position. Oligonucleotide synthesis is presently mechanized and enabled by current 
technology, see. ag., U.&S.N. 07/362,901 (VLSIPS parent); U.&S.N. 07/492,462 (VLSIPS CIP); and instruments sup- 
25 plied by Applied Biosystems, Foster City, California. 

A. Preparation of SutetrateltettK 

The production of the collection of specific oligonucleotides used in polynucleotide sequencing may be produced 
so in at least two different ways. Present technology certainly allows production of ten nucleotide oligomers on a solid 

phase or other synthesizing system. See, e.g.. instrumentation provided by Applied Biosystems, Foster City. California. 

Although a single oligonucleotide can be relatively easily made, a large collection of them would typically require a fairly 

large amount of time and investment For example, there are 4 10 = 1.048,576 posstole ten nucleotide oligomers. 

Present technology allows making each and every one of them in a separate purified form though such might be costly 
ss and laborious. 

Once the desired repertoire of possible oligomer sequences of a given length have been synthesized, this collec- 
tion of reagents may be individually positional iy attached to a substrate, thereby allowing a batchwise hybridization step. 
Present technology also would allow the possfcifity of attaching each and every one of these 1 0-mers to a separate spe- 
cie position on a solid matrix. This attachment could be automated in any of a number of ways, particularly use of a 
40 caged biotin type Bnking. This would produce a matrix having each of different possible 1 0-mers. 

A batchwise hybridization is much preferred because of its reproducftxTrty and sinpfictty. An automated process of 
attaching various reagents to pos'rtionally defined sites on a substrate is provided in PCT publication no. W090/15070; 
U.S.S.N. 07/624,120; and PCT publication no. W091/07087; each of which is hereby incorporated herein by reference. 

Instead of separate synthesis of each oligonucleotide, these oBgonucleotides are conveniently synthesized in par- 
45 allel by sequential synthetic processes on a defined matrix pattern as provided in PCT publication no. WO90/15070; 
and U.S.S.N. 07/524,120, which are incorporated herein by reference. Here, the oligonucleotides are synthesized step- 
wise on a substrate at positional^ separate and defined positions. Use of photosensitive blocking reagents allows for 
defined sequences of synthetic steps over the surface of a matrix pattera By use of the binary masking strategy, the 
surface of the substrate can be positioned to generate a desired pattern of regions, each having a defined sequence 
so oligonucleotide synthesized and immobilized thereto 

Although the prior art technology can be used to generate the desired repertoire of oligonucleotide probes, an effi- 
cient and cost effective means would be to use the VLSIPS technology descrfoed in PCT publication no WO90715070 
and U. S.S.N. 07/624,120. In this embodiment the photosensitive reagents involved in the production of such a matrix 
are described below. 

65 The regions for synthesis may be very small, usually less than about 100 urn x 100 ^m, more usually less than 
about 50 nm x 50 \jjrx. The photolithography technology allows synthetic regions of less than about 10 jun x 10 jim, 
about 3 urn x 3 urn, or less. The detection also may detect such sized regions, though larger areas are more easily and 
reliably measured. 



14 




EP0834 575A2 



At a size of about 30 microns by 30 microns, one milfion regions would take about 1 1 centimeters square or a single 
wafer of about 4 centimeters by 4 centimeters. Thus the present technology provides for making a single matrix of that 
size having aD one million plus possible oligonucleotides. Region size are sufficiently Gmall to correspond to densities 
of at least about 5 regions/cm 2 , 20 regions/cm 2 . 50 regions/cm 2 , 100 regions/cm 2 , and greater, including 300 

s regions/cm 2 , 1000 regions/cm 2 , 3K regions/an 2 . 10K regions/cm 2 . 30K regions/cm 2 , 100K regions/cm 2 , 300K 
regions/cm 2 or more, even in excess of one million regions/cm 2 . 

Although the pattern of the regions which contain specific sequences is theoretically not important, for practical rea- 
sons certain patterns will be preferred in synthesizing the oligonucleotides. The application of binary masking algo- 
rithms for generating the pattern of known oligonucleotide probes is described in related U.&S.N. 07/624,120. By use 

10 of these binary masks, a highly efficient means is provided for producing the substrate with the desired matrix pattern 
of different sequences. Although the binary masking strategy allows for the synthesis of all lengths of polymers, the 
strategy may be easily modified to provide only polymers of a gjven length. This is achieved by omitting steps where a 
eubunit is not attached 

The strategy tor generating a specific pattern may take any of a number of different approaches. These approaches 

15 are well described in related application liS.S.N. 07/524,120 and include a number of binary masking approaches 
which will not be exhaustively discussed herein. However, the binary masking and binary synthesis approaches provide 
a maximum of diversity with a minimum number of actual synthetic steps. 

The length of oligonucleotides used in sequencing applications will be selected on criteria determined to some 
extent by the practical limits discussed above. For example, if probes are made as oligonucleotides, there win be 65,536 

20 possible eight nucleotide sequences. If a nine submit oligonucleotide is selected, there are 262.144 possible permea- 
tions of sequences. K a ten-mer oligonucleotide is selected, there are 1,048,576 possible permeations of sequences. 
As the number gets larger, the required number of positionally defined subunrts necessary to saturate the possMities 
also increases. With respect to hybridization conditions, the length of the matching necessary to converse stability of 
the conditions selected can be compensated for. See. e.g., Kanehisa, M. (1984) Nuc. Adds Res. 12203-213. which is 

25 hereby incorporated herein by reference. 

Although not described in detail here, but below for oligonucleotide probes, the VLSIPS technology would typically 
use a photosensitive protective group on an oligonucleotide. Sample oligonucleotides are shown in Figure 1 . In partic- 
ular, the photoprotecUve group on the nucleotide molecules may be selected from a wide variety of positive light reactive 
groups preferably including n'rtro aromatic compounds such as o-nitrobenzyl derivatives or benzytsuffonyf. See, eg., 

so Gait (1984) Oligonucleotide Synthesis: A practical Approach. IRL Press. Oxford, which is hereby incorporated herein 
by reference. In a preferred embodiment 6-n*rtro-veratry1 oxycarbony (NVOC). 2-nitrobenzyi oxycarbonyl (NBOC), or 
a,aKfimethyl<timethoxyberizyl oxycarbonyl (DEZ) is used. Photoremovable protective groups are described in, ag., 
Patchomik (1970) J. Amer. Chem. Soc. 92:6333; and Amit et al. (1974) J. Organic Chem. 39:192; each of which is 
hereby incorporated herein by reference. 

as A preferred linker for attaching the oligonucleotide to a silicon matrix is illustrated in Figure 2. A more detailed 
description is provided below. A photosensitive blocked nucleotide may be attached to specific locations of unblocked 
prior cycles of attachments on the substrate and can be successively built up to the correct length oligonucleotide 
probe. 

ft should be noted that multiple substrates may be simultaneously exposed to a single target sequence where each 

40 substrate is a duplicate of one another or where, in combination, multiple substrates together provide the complete or 
desired subset of poss&le subsequences. This provides the opportunity to overcome a limitation of the density of posi- 
tions on a 6ingle sifcstrate by using multiple substrates. In the extreme case, each probe might be attached to a single 
bead or substrate and the beads sorted by whether there is a binding interaction. Those beads which do bind might be 
encoded to indicate the subsequence specificity of reagents attached thereto. 

45 Then, the target may be bound to the whole collection of beads and those beads that have appropriate specific rea- 
gents on tfiem will bind to target Then a sorting 6ystem may be utilized to sort those beads that actually bind the target 
from those that do not This may be accomplished by presently available cell sorting devices or a similar apparatus. 
After the relatively small number of beads which have bound the target have been collected, the encoding scheme may 
be read off to determine the specificity of the reagent on the bead. An encoding 6ystem may include a magnetic system, 

60 a shape encoding system, a color encoding system, or a combination of any of these, or any other encoding system. 
Once again, with the collection of specific interactions that have occurred, the binding may be analyzed for sequence 
information, fingerprint information, or mapping information. 

The parameters of polynucleotide sizes of both the probes and target sequences are determined by the applica- 
tions and other circumstances. The length of the oligonucleotide probes used will depend In part upon the limitations of 

55 the VLSIPS technology to provide the number of desired probes. For example, in an absolute sequencing application, 
it is often useful to have virtually all of the possible oligonucleotides of a given length. As indicated above, there are 
65.536 8-mers. 262,144 9-mers. 1,048.576 10-mers. 4,194,304 1 1-mers. etc. As the length of the oligomer increases 
the number of different probes which must be synthesized also increases at a rate of a factor of 4 for every additional 
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nucleotide. Eventually the size of the matrix and the limitations in the resolution of regions in the matrix-will reach the 
point where an increase in number of probes becomes disadvantageous. However, this sequencing procedure requires 
that the system be able to distinguish, by appropriate selection of hybridization and washing conditions, between bind- 
ing of absolute fidelity and binding of complementary sequences containing mismatches. On the other hand, if the fidel- 
5 rty is unnecessary, this discrimination is also unnecessary and a significantly longer probe may be used. Significantly 
longer probes would typically be useful in fingerprinting or mapping applications. 

The length of the probe is selected for a length that it will bind with specificity to possible targets. The hybridization 
conditions are also very important in that they will determine how close the homology of complementary binding will be 
detected. In fact, a single target may be evaluated at a number of different conditions to determine its spectrum of spe- 
10 cfficsty for binding particular probes. This may find use in a number of other applications besides the polynucleotide 
sequencing fingerprinting or mapping. In a related fashion, different regions with reagents having differing affinities or 
levels of specificity may allow such a spectrum to be defined using a single incubation, where various regions, at a given 
hybridization condition, show the binding affinity. For example, fingerprint probes of various lengths, or with specific 
defined non-matches may be used. Unnatural nucleotides or nucleotides exhfciting modified specificity of complemen- 
ts tary binding are described in greater detafl in Macevicz (1990) PCI pub. No. WD 90/04652; and see the section on 
modified nucleotides in the Sigma Chemical Company catalogue. 

B. Labeling Target Nucleotide 

20 The label used to detect the target sequences will be determined, in part by the detection methods being applied. 
Thus, the labeling method and label used are selected in combination with the actual detecting systems being used. 

Once a particular label has been selected, appropriate labeling protocols will be applied, as described below for 
specific embodiments. Standard labeling protocols for nucleic acids are described, e.g., in Sambrook et al.; Kambara. 
H. etaL (1988) BioTechndoov 6:816^821: Smith, L etal. (1985) Nuc. Acids Res. 132399-2412; for polypeptides, see, 

25 e.g.. Allen GL (1989) Sequencing of Proteins and Peptides. Elsevier, New York, especially chapter 5, and Qreenstein 
and Winitz (1961) Chemistry of the Amino Acids. Wiley and Sons, New York Carbohydrate labeling is described, &g., 
in Chaplin and Kennedy (1986) Carbohydrate Analysis: A Practical Approach. IRL Press, Oxford. Labeling of other pol- 
ymers wiD be performed by methods applicable to them as recognized by a person having ordinary sWD in manipulating 
the corresponding polymer. 

so In some embodiments, the target need not actually be labeled if a means for detecting where interaction takes 
place is available. As described below, for a nucleic add embodiment such may be provided by an intercalating dye 
which Intercalates only into double stranded segments, e.g., where interaction occurs. See, e.g., Sheldon et al. U.S. 
Pat No. 4,582.789. 

In many uses, the target sequence wiB be absolutely homogeneous, both with respect to the total sequence and 
36 with respect to the ends of each molecule. Homogeneity with respect to seqijencefeimpcxiamtoavoklarrtoiguTty. It is 
preferable that the target sequences of interest not be contaminated with a significant amount of labeled contaminating 
sequences. The extent of allowable contamination will depend on the sensitivity of the detection system and the inher- 
ent signal to noise of the system. Homogeneous contamination sequences wfll be particularly disruptive of the 
sequencing procedure. 

40 However, although the target polynucleotide must have a unique sequence, the target molecules need not have 
identical ends. In fact, the homogeneous target molecule preparation may be randomly sheared to increase the numer- 
ical number of molecules. Since the total information content remains the same, the shearing results only in a higher 
number of distinct sequences which may be labeled and bind to the probe. This fragmentation may give a vastly supe- 
rior signal relative to a preparation of the target molecules having homogeneous ends. The signal for the hybridization 

45 is Iftety to be dependent on the numerical frequency of the target-probe interactions. If a sequence is individually found 
on a larger number of separate molecules a better 6ignal will result In fact shearing a homogeneous preparation of the 
target may often be preferred before the labeling procedure is performed, thereby producing a large number of labeling 
groups associated with each subsequence 

so c. Hvfrrfflzation CQntfittons 

The hybridization conditions between probe and target should be selected such that the specific recognition inter- 
action, i.e., hybridization, of the two molecules is both sufficiently specific and sufficiently stable. See. eg., Hemes and 
Higgins (1985) Nucleic Acid Hybridisation: A Practical Approach. IRL Press, Oxford. These conditions will be depend- 
55 ent both on the specific sequence and often on the guanine and cytosine (QC) content of the complementary hybrid 
strands. The conditions may often be selected to be universally equally stable independent of the specific sequences 
involved. This typically will make use of a reagent such as an aryl ammonium buffer. See, Wood et al. (1985) "Base 
ComposHiorMndependent Hybridization In Tetramethylammonium Chloride: A Method for Oligonucleotide Screening of 
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HigWy Complex Gene Libraries; Proc. Natl. Acad. Sci. USA. 82:1585-1588; and Krupov et al. (1989) "An Oligonucle- 
otide Hybridization Approach to DNA Sequencing," FEBS Letters. 256:1 18-122; each of which is hereby incorporated 
herein by reference. An arylammonium buffer tends to minimize differences in hybridization rate and stability due to GC 
content By virtue of the fact that sequences then hybridize with approximately equal affinity and stability, there is rela- 

5 tr/efy little bias in strength or kinetics of binding for particular sequences. Temperature and salt conditions along with 
other buffer parameters should be selected such that the kinetics of renaturation should be essentially independent of 
the specific target subsequence or oligonucleotide probe involved. In order to ensure this, the hybridization reactions 
will usually be performed in a single incubation of all the substrate matrices together exposed to the identical same tar- 
get probe solution under the same conditions. 

io AKematively, various substrates may be individually treated differently. Different substrates may be produced, each 
having reagents which bind to target subsequences with substantially identical stabilities and kinetics of hybridization. 
For example, all of the high GC content probes could be synthesized on a single substrate which is treated accordingly. 
In this embodiment, the arylammonium buffers could be unnecessary. Each substrate is then treated in a manner that 
the collection of substrates show essentially uniform binding and the hybridization data of target binding to the individual 

15 substrate matrix is combined with the data from other substrates to derive the necessary subsequence binding infor- 
mation. The hybridisation conditions wiO usually be selected to be sufficiently specific that the fidelity of base matching 
wiD be property discriminated. Of course, control hybridizations should be included to determine the stringency and 
kinetics of hybridization. 

20 D. Detection: VLSIPS Scanning 

The next step of the sequencing process by hybridization involves labeling of target polynucleotide molecules. A 
quickly and easily detectable signal is preferred. The VLSIPS apparatus is designed to easily detect a fluorescent label, 
so fluorescent tagging of the target sequence is preferred. Other suitable labels include heavy metal labels, magnetic 

25 probes, chromogenic labels (e.g., phosphorescent labels, dyes, and fluorophores) spectroscopic labels, enzyme Gnked 
labels, radioactive labels, and labeled binding proteins. Additional labels are descrtoed in U.S. Pat No. 4,366,241, 
which is incorporated herein by reference 

The detection methods used to determine where hybridization has taken place will typically depend upon the label 
selected above. Thus, for a fluorescent label a fluorescent detection step will typically be used. PCT publication no. 

so WO90/1 5070 and U.S.& N. 07/624.1 20 describe apparatus and mechanisms for scanning a substrate matrix using flu- 
orescence detection, but a similar apparatus is adaptable for other optically detectable labels. 

The detection method provides a positional localization of the region where hybridization has taken place. However, 
the position is correlated with the specific sequence of the probe since the probe has specifically been attached or syn- 
thesized at a defined substrate matrix position. Having collected afl of the data indicating the subsequences present in 

35 the target sequence, this data may be aligned by overlap to reconstruct the entire sequence of the target as Qlustrated 
above. 

It is also possfcle to dispense with actual labeling if some means for detecting the positions of interaction between 
the sequence specific reagent and the target molecule are available. This may take the form of an additional reagent 
which can indicate the sites either of Interaction, or the sites of lack of interaction, e.g.. a negative label. For the nucleic 
40 acid ernbocfiments, locations of double strand interaction may be detected by the incorporation of intercalating dyes, or 
other reagents such as antibody or other reagents that recognize helix formation, see, e g,. Sheldon, et al. (1986) U.S. 
Pat No. 4,582,789. which is hereby incorporated herein by reference 

e. Analysis 

45 

Although the reconstruction can be performed manually as illustrated above, a computer program will typically be 
used to perform the overlap analysis. A program may be written and run on any of a large number of different computer 
hardware systems. The variety of operating systems and languages useable wfll be recognized by a computer software 
engineer. Various different languages may be used, e.g.. BASIC; C; PASCAL; etc. A simple flow chart of data analysis 
60 Is llustrated In Figure 4. 

f. Substrate Reuse 

Finally, after a particular sequence has been hybridized and the pattern of hybridization analyzed, the matrix sub- 
55 strate should be reusable and readily prepared for exposure to a second or subsequent target polynucleotides. In order 
to do so. the hybrid duplexes are disrupted and the matrix treated in a way which removes all traces of the original tar- 
get The matrix may be treated with various detergents or solvents to which the substrate, the oligonucleotide probes, 
and the linkages to the substrate are inert This treatment may include an elevated temperature treatment treatment 
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with organic or inorganic solvents, modifications in pH. and other means for disrupting specffic interaction. Thereafter, 
a second target may actually be applied to the recycled matrix and analyzed as before. 

IV. FINGERPRINTING 

A. General 

Many of the procedures and techniques used in the polynucleotide sequencing section are also appropriate for fin- 
gerprinting applications. See, e.g., Poustka. et al. (1986) Cold Spring Harbor Sympos ia on Quant Biol., vol. LI, 131- 

10 1 39, Cdd Spring Harbor Press, New Vbrk; which is hereby incorporated hersn by reference. The fingerprinting method 
provided herein ts based, in part upon the ability to posrtionally localize a large number of different specific probes onto 
a single substrate. This high density matrix pattern provides the ability to screen for, or detect a very large number of 
different sequences simultaneously. In fact depending upon the hybridization conditions, fingerprinting to the resolution 
of virtually absolute matching of sequence Is possible thereby approaching an absolute sequencing embodiment And 

is the sequencing embodiment is very useful in identifying the probes useful in further fingerprinting uses. For example, 
characteristic features of genetic sequences will be identified as being diagnostic of the entire sequence. However, in 
most embodiments, longer probe and target wfll be used, and for which slight mismatching may not need to be resolved. 

B. Preparation of Substrate Matrix 

20 

A collection of specffic probes may be produced by either of the methods described above in the section on 
sequencing. Specific oligonucleotide probes of desired lengths may be individually synthesized on a standard oligonu- 
cleotide synthesizer. The length of these probes is limited only by the length of the ability of the synthesizer to continue 
to accurately synthesize a molecule. Oligonucleotides or sequence fragments may also be isolated from natural 
25 sources. Biological amplification methods may be coupled with synthetic synthesizing procedures such as, e.g., 
polymerase chain reaction. 

In one embodiment the individually isolated probes may be attached to the matrix at defined positions. These 
probe reagents may be attached by an automated process making use of the caged bkrtin methodology described in 
U.S.S.N. 07/612,671 (caged biotin CIP), or using photochemical reagents, see. e.g., Dattagupta et al. (1985) US. Pat. 

so No. 4,542.102 and (1987) U.S. Pat. No 4.713,326. Each individual purified reagent can be attached individually at spe- 
cific locations on a substrate. 

In another embodiment the VLSIPS synthesizing technique may be used to synthesize the desired probes at spe- 
cific positions on a substrata The probes may be synthesized by successively adding appropriate monomer subunits, 
e.g., nucleotides, to generate the desired sequences. 

35 In another embodiment, a relatively short specific oligonucleotide is used which serves as a targeting reagent for 
posrtionally directing the sequence recognition reagent For example, the sequence specific reagents having a separate 
additional sequence recognition segment (usually of a different polymer from the target sequence) can be directed to 
target oligonucleotides attached to the substrate. By use of non-natural targeting reagents. e.g.. unusual nucleotide 
analogues which pair with other unnatural nucleotide analogues and which do not interfere with natural nucleotide inter- 

40 actions, the natural and non-natural portions can coexist on the same molecule without interfering with their individual 
functionalities. This can combine both a synthetic and biological production system analogous to the technique for tar- 
geting monoclonal antibocf es to locations on a VLSIPS substrate at defined positions. Unnatural optical isomers of 
nucleotides may be useful unnatural reagents subject to similar chemistry, but Incapable of interfering with the natural 
biological polymers. See also, U.S.S.N. 07/626,730, filed December 6. 1990; which is hereby incorporated herein by 

45 reference. 

After the separate substrate attached reagents are attached to the targeting segment the two are crosslinked, 
thereby permanently attaching them to the substrate. Suitable crossfinking reagents are known, see, e.g., Dattagupta 
et al. (1985) U.S. Pat No. 4.542,102 and (1 987) "Coupling of nucleic acids to solid support by photochemical methods," 
U.S. Pat No. 4,713,326, each of which is hereby incorporated herein by reference. Similar Gnkages fa attachment of 
co proteins to a solid substrate are provided, eg., in Merrifield (1986) Science 232341. which is hereby incorporated 
•herein by reference. 

c. Labeling Target Nucleotides 

55 The labeling procedures used in the sequencing embodiments will also be applicable in the fingerprinting embod- 
iments. However, since the fingerprinting embodiments often will involve relatively large target molecules and relatively 
short oligonucleotide probes, the amount of signal necessary to incorporate into the target sequence may be less crit- 
ical than in the sequencing applications. For example, a relatively long target with a relatively small number of labels per 
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molecule may be easily amplified or detected because of the relatively large target molecule size. 

In various embodiments, it may be desired to cleave the target into smaller segments as in the sequencing embod- 
iments. The labeling procedures and cleavage techniques described in the sequencing embodiments would usually 
also be applicable here. 

6 

D. Hybridization Conditions 

The hybridization conditions used in fingerprinting embodiments will typically be less critical than for the sequenc- 
ing embodiments. The reason is that the amount of mismatching which may be useful in providing the fingerprinting 

10 information would typically be far greater than that necessary in sequencing uses. For example, Southern hybridiza- 
tions do not typically distinguish between slightly mismatched sequences. Under these circumstances, important and 
valuable information may be arrived at with less stringent hybridisation conditions while providing valuable fingerprinting 
information. However, since the entire substrate fe typically exposed to the target molecule at one time, the binding affin- 
ity of the probes should usually be of approximately comparable levels. For this reason, rf oligonucleotide probes are 

15 being used, their lengths should be approximately comparable and will be selected to hybridize under conditions which 
are common for most of the probes on the substrate Much as in a Southern hybridization, the target and oligonucle- 
otide probes are of lengths typically greater than about 25 nucleotides. Under appropriate hybridization conditions, e.g., 
typically higher salt and lower temperature, the probes will hybridize irrespective of imperfect complementarity. In fact, 
with probes of greater than. eg., about fifty nucleotides, the difference in stability of dfferent sized probes wfll be rela- 

20 trvely minor. 

Typically the fingerprinting is merely for probing similarity or homology. Thus, the stringency of hybridization can 
usually be decreased to fairly low levels. See e.g., Wetmur and Davidson (1968) "Kinetics of Renaturation of DNA." *L 
Mol. Biol.. 31 349-370; and Kanehisa. M. (1984) Nuc Acids Res.. 12203-213. 

2S E. Detection: VLSIPS Scanning 

Detection methods wfll be selected which are appropriate for the selected label. The scanning device need not nec- 
essarily be digitized or placed into a specific <f gital database, though 6uch would most likely be done. For example, the 
analysis in fingerprinting could be photographic. Where a standardized fingerprint substrate matrix is used, the pattern 

so of hybridizations may be spatially unique and may be compared photographically. In this manner, each sample may 
have a characteristic pattern of interactions and the likelihood of Identical patterns will preferably be such low frequency 
that the fingerprint pattern indeed becomes a characteristic pattern virtually as unique as an individual's fingertip fin- 
gerprint With a standardized substrate every individual could be. in theory, uniquely identif iable on the basis of the pat- 
tern of hybridizing to the substrate. 

55 Of course, the VLSIPS scanning apparatus may also be useful to generate a digitized version of the fingerprint pat- 
tern. In this way, the identification pattern can be provided in a linear siring of digits. This sequence could also be used 
for a standardized identification system providing significant useful medical transferability of specific data. In one 
enixxfiment. the probes used are selected to be of sufficiently high resolution to measure polynucleotides encoding 
antigens of the major histocompatibility complex, it might even be possible to provide transplantation matching data in 

40 a linear stream of data The fingerprinting data may provide a condensed version, or summary, of the linear genetic 
data, or any other information data base 

F. Analysis 

45 The analysis of the fingerprint will often be much simpler than a total sequence determination. However, there may 
be particular types of analysis which will be substantially simplified by a selected group of probes. For example, probes 
which exhibit particular poputational heterogeneity may be selected. In this way, analysis may be simplified and practi- 
cal utility enhan<fed merely by careful selection of the specific probes and a careful matrix layout of those probes. 

As with the sequencing application, the fingerprinting usages may also take advantage of the reusability of the sub- 
strate. In this way. the interactions can be disrupted, the substrate treated, and the renewed substrate is equivalent to 
an unused substrate. 

55 

H. Other Polynucleotide Aspects 

Besides using the fingerprinting method for analyzing the structure of a particular polynucleotide, the fingerprinting 
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method may be used to characterize various samples. For example, a cell or population of cells may be tested for their 
expression of particular mRNA sequences, or for patterns of expressed mRNA species. This may be applicable to a cell 
or tissue type, to the expressed messenger RNA population expressed by a cell to the genetic content of a cell. 

RNA can be isolated from a cell or a ceQ population, such as a purified cell fraction or a biopsy sample. The RNA 

5 may be labeled, for example by attaching a fluorescent molecule to isolated RNA or by using radiolabeled RNA (e.g., 
end-labeled with T4 polynucleotide kinase). A VLSIPS substrate containing positional^ discrete oligonucleotide 
sequences may then be exposed to the pool of labeled RNA species under conditions permitting specific hybridisation. 
The pattern of positions at which labeled RNA has formed specific hybrids may be compared to a reference pattern to 
identify, and in some embodiments quantify, the expressed RNA species, or to identify the hybridization pattern itself as 

10 being characteristic of a particular cell typa 

For example but not for limitation, a VLSIPS oligonucleotide substrate may be hybridized to a labeled RNA sample 
obtained from a first ceD type (e.g., human lymphocytes) to establish a reference hybridization pattern for the first cell 
type. Similarly, an identical VLSIPS oligonucleotide substrate may be hybridized to a labeled RNA sample obtained 
from a second ceil type (eg., human monocytes) to establish a reference hybridisation pattern for the second cell type. 

is Labeled RNA may then be prepared from a cell or a cell population and hybridized to an identical VLSIPS oligonucle- 
otide substrate, and the resultant hybridization pattern can be compared to the reference hybridization patterns estab- 
lished for the first and second cell types. By such comparisons, the RNA expression pattern of a cell or cell population 
can be identified as being similar to or distinct from one or more reference hybridization patterns. 

Where a positional ly discrete oligonucleotide on the VLSIPS substrate is in molar excess over the amount of the 

20 cognate (complementary) labeled RNA species in the hybridization reaction, the amount of specific hybridization to that 
VLSIPS locus (as measured by labeling intensity at that locus) can provide a quantitative measurement of the cognate 
RNA species present in the labeled RNA sample Thus, hybridization of labeled RNA to a VLSIPS oligonucleotide sub- 
strate can provide information identifying the individual RNA species that are expressed in a particular cell or ceO pop- 
ulation, as wen as the relative abundance of one or more individual RNA species. This information can 6erve to 

25 fingerprint specific cell types or particular stages in cell differentiation. 

For example but not for Emrtation, RNA samples prepared from tissue biopsies, specifically tumor biopsies, can be 
labeled and hybridised to a VLSIPS oligonucleotide substrate, and the resultant hybridization pattern can provide infor- 
mation regarding cell type, degree of differentiation, and metastic potential (maGgnancy). Some of the positionally dis- 
tinct oligonucleotides may hybridize specifically with RNA species transcribed from endogenous proto-oncogens (e.g., 

so c-myc c-ras H , c-sis, etc.) which are, in certain instances, transcribed at elevated levels in neoplastic tissues. 

In addition to diagnostic applications, labeled RNA samples from various neoplastic cell types may be hybridized 
to VLSIPS oligonucleotide substrates and the resultant hybridization pattem(s) compared to reference patterns 
obtained with RNA from related, non-neoplastic cefl types. Identification of distinctions between the hybridization pat- 
terns obtained with RNA from neoplastic cells as compared to patterns obtained from RNA from non-neoplastic cells 

35 may be of diagnostic value and may identify RNA species that encode proteins that are potential targets for novel ther- 
apeutic modalities. In fact the high resolution of the test win allow more complete characterization of parameters which 
define particular diseases. Thus, the power of diagnostic tests may be limited by the extent of statistical correlation with 
a particular concfition rather than with the number of RNA species which are tested. The present invention provides the 
means to generate this large universe of possfole reagents and the ability to actually accumulate that correlative data. 

40 For fingerprinting of RNA expression patterns, the VLSIPS substrate polynucleotides wiD be at least 12 nucleotides 
in length, preferably at least 15 nucleotides in length, more preferably at least 25 nucleotides in length. The sequences 
of the posttionalty cfistinct polynucleotides on the VLSIPS substrate may be selected from published sources of 
sequence data, including but not limited to computerized database such as GenBank, and may or may not Include ran- 
dom or pseudorandom sequences for detecting RNA species which have not yet been identified in the art Fingerprint 

45 analysis of RNA expression patterns will typically employ high-stringency washes so as to provide hybridization pat- 
terns that reflect predominantly specific hybridization. However, some non-specific hybridization and/or cross-hybridi- 
zation to sHghtry mismatched sequences may be tolerated, and in some embodiments may be desirable. 

The abflity to generate a high density means for screening the presence or absence of specific interactions allows 
for the possibility of screening for, if not saturating, all of a very large number of possible interactions. This is very pow- 

60 erfu! in providing the means for testing the combinations of molecular properties which can define a class of samples. 
For example, a species of organism may be characterized by its DNA sequences, e.g., a genetic fingerprint By using 
a fingerprinting method, it may be determined that aO members of that species are sufficiently similar in specific 
sequences that they can be easily identified as being within a particular grou>. Thus, newly defined classes may be 
resolved by their similarity In fingerprint patterns. Alternatively, a non-member of that group win fail to 6hare those many 

55 identifying characteristics. However, since the technology allows testing of a very large number of specific interactions, 
it also provides the ability to more finely distinguish between closely related different cells or samples. This will have 
important applications in diagnosing viral, bacterial, and other pathological on nonpathological infections. 

In particular, cell classification may be defined by any of a number of different properties. For example, a cell class 
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may be defined by its DNA sequences contained therein. This allows species identification for parasitic or other infec- 
tions. For example, the human ceO is presumably genetically distinguishable from a monkey cell, but different human 
cells will share many genetic markers. At higher resolution, each individual human genome wilt exhibit unique 
sequences that can define it as a single individual. 

5 Likewise, a developmental stage of a cell type may be definable by its pattern of expression of messenger RNA. 
For example, in particular stages of ceils, high levels of ribosomal RNA are found whereas relatively low levels of other 
types of messenger RNAs may be found. The high resolution distinguishabiGty provided by this fingerprinting method 
allows the distinction between cells which have relatively minor differences in its expressed mRNA population. Where 
a pattern is shown to be characteristic of a stage, a stage may be defined by that particular pattern of messenger RNA 

w expression. 

In another embodiment a substrate as provided herein may be used for genetic screening. This would allow for 
simultaneous screening of thousands of genetic markers. As the density of the matrix is increased, many more mole- 
cules can be simultaneously tested. Genetic screening then becomes a simpler method as the present invention pro- 
vides the ability to screen for thousands, tens of thousands, and hundreds of thousands, even millions of different 

15 possible genetic features. However, the number of high correlation genetic markers for conditions numbers only in the 
hundreds. Again, the possibility for screening a large number of sequences provides the opportunity for generating the 
data which can provide correlation between sequences and specific conditions or susceptibility. The present invention 
provides the means to generate extremely valuable correlations useful for the genetic detection of the causative muta- 
tion leading to medical conditions. In still another embodiment the present invention would be applicable to distinguish- 

so ing two individuals having identical genetic compositions. The antibody population within an individual is dependent 
both on genetic and historical factors. Each individual experiences a unique exposure to various infectious agents, and 
the combined antibody expression is partly determined thereby. Thus, individuals may also be fingerprinted by their 
lymphocyte DNA or RNA hybridization patterns). Similar sorts of immunological and environmental histories may be 
useful for fingerprinting, perhaps in combination with other screening properties. 

25 With the definition of new classes of cells, a cell sorter will be used to purify them. Moreover, new markers for defin- 
ing that class of cells will be identified. For example, where the class is defined by its RNA content cells may be 
screened by antisense probes which detect the presence or absence of specific sequences therein Alternatively, cell 
lysates may provide information useful in correlating intracellular properties with extracellular markers which indicate 
functional differences. Using standard cell sorter technology with a fluorescence or labeled antisense probe which rec- 

so ognizes the internal presence of the specific sequences of interest, the cell sorter win be able to isolate a relatively 
homogeneous population of cells possessing the particular marker. Using successive probes the sorting process 
should be able to select for cells having a combination of a large number of different markers. 

With the fingerprinted method as in identification means arises from mosaism problems in an organism. A mosaic 
organism is one whose genetic content in different cells is significantly different. Various clonal populations should have 

35 similar genetic fingerprints, though different clonal populations may have different genetic contents. See, for example, 
Suzuki et al. An Introduction to Genetic Analysis (4th Ed.), Freeman and Co., New York, which is hereby incorporated 
herein by reference. However, this problem should be a relatively rare problem and could be more carefully evaluated 
with greater experience using the fingerprinting methods. 

The invention will also find use in detecting changes, both genetic and in protein expression (Le., by RNA expres- 

40 sion fingerprinting), in a rapidly "evolving" protozoan infection, or similarly changing organism. 

V. MAPPING 
A. General 

45 

The use of the present invention for mapping parallels its use for fingerprinting and sequencing. Mapping provides 
the ability to locate particular segments along the length of the polynucleotide. The mapping provides the ability to 
locate, in a relative sense, the order of various subsequences. The may be achieved using at least two different 
approaches. 

60 The first approach is to take the large sequence and fragment ft at specific points. The fragments are then ordered 
and attached to a solid substrata For example, the clones resufting from a chromosome walking process may be indi- 
vidually attached to the substrate by methods, e.g., caged biotin techniques, indicated earlier. Segments of unknown 
map position wOt be exposed to the substrate and wiO hybridize to the segment which contains that particular sequence. 
This procedure allows the rapid determination of a number of different labeled segments, each mapping requiring only 

55 a single hybridization step once the substrate is generated. The substrate may be regenerated by removal of the inter- 
action, and the next mapping segment appOed. 

In an alternative method, a plurality of subsequences can be attached to a substrate. Various short probes may be 
appGed to determine which segments may contain particular overlaps. The theoretical basis and a description of this 
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mapping procedure is contained in, eg.. Evans et al. 1989 "Physical Mapping of Complex Genomes by Cosmid Multi- 
plex Analysis." Proc. Natl. Acad. Sri. USA 86:5030-5034. and other references cited above in the Section labeled 
"Overall Description." Using this approach, the details of the mapping embodiment are very similar to those used in the 
fingerprinting embodiment 

5 

B. Preparation of Substrate Matrix 

The substrate may be generated in either of the methods generally applicable in the sequencing and fingerprinting 
embodiments. The substrate may be made either synthetically, or by attaching otherwise purified probes or sequences 

10 to the matrix The probes or sequences may be derived either from synthetic or biological means. As indicated above, 
the solid phase substrate synthetic methods may be utilized to generate a matrix with positional ly defined sequences. 
In the mapping embodiment the importance of saturation of all possible subsequences of a preselected length is far 
less important than in the sequencing embodiment, but the length of the probes used may be desired to be much 
longer. The processes for making a substrate which has longer oligonucleotide probes should not be significantly cGf- 

15 ferent from those described for the sequencing embodiments, but the optimization parameters may be modified to com- 
ply with the mapping needs. 

C. Labeling 

20 The labeling methods will be similar to those applicable in sequencing and fingerprinting embodiments. Again, the 
target sequences may be desired to be fragmented. 

P. Hybridization/Specrific Interaction 

& The specificity of interaction between the targets and probe would typically be closer to those used for fingerprinting 
embodiments, where homology is more important than absolute distingusshability of high fidelity complementary hybrid- 
ization. Usually, the hybridization conditions will be such that merely homologous segments will interact and provide a 
positive signal. Much like the fingerprinting embodiment it may be useful to measure the extent of homology by suc- 
cessive incubations at higher stringency conditions. Or, a plurality of different probes, each having various levels of 

so homology may be used. In either way, the spectrum of homologies can be measured. 

E. Dqtetfion 

The detection methods used in the mapping procedure wfll be virtually identical to those used in the fingerprinting 
55 embodiment The detection methods will be selected in combination with the labeling methods. 

F. Analysis 

The analysis of the data in & mapping embodiment will typically be somewhat different from that in fingerprinting. 

40 The fingerprinting embodiment win test for the presence or absence of specific or homologous segments. However, in 
the mapping embodiment the existence of an interaction is coupled with some indication of the location of the interac- 
tion. The interaction is mapped in some manner to the physical polymer sequence. Some means for determining the 
relative positions of different probes Is performed. This may be achieved by synthesis of the substrate in pattern, or may 
result from analysis of sequences after they have been attached to the substrata 

45 For example, the probes may be randomly positioned at various locations on the substrate. However, the relative 
positions of the various reagents in the original polymer may be determined by using 6hort fragments, eg., individually, 
as target molecules which determine the proximity of different probes. By an automated system of testing each different 
short fragment of the original polymer, coupled with proper analysis, it will be possible to determine which probes are 
adjacent one another on the original target sequence and oorrelate that with positions on the matrix. In this way, the 

so matrix Is useful for determining the relative locations of various new segments in the original target molecule. This sort 
of analysis is descrbed in Evans, and the related references described above. 

In another form of mapping, as described above in the fingerprinting section, the developmental map of a cell or 
biological system may be measured using fingerprinting type technology. Thus, the mapping may be along a temporal 
dimension rather than along a polymer dimension. The mapping or fingerprinting embodiments may also be used In 

55 determining the genetic rearrangements which may be genetically important as in lymphocyte and B-cell development. 
In another example, various rearrangements or chromosomal dislocations may be tested by either the fingerprinting or 
mapping methods. These techniques are similar in many respects and the fingerprinting and mapping embodiments 
may overlap in many respects. 



22 



EP 0 834 575 A2 



G. Substrate Reuse 

The substrate should be reusable in the manner described in the f ingerprinting section. The substrate is renewed 
by removal of the specific interactions and is washed and prepared for successive cycles of exposure to new target 
6 sequences. 

VI. ADDITIONAL SCREENING AND APPLICATIONS 

A. Specific Interactions 

10 

As originally indicated in the parent fifing of VLSIPS. the production of a high density plurality of spatially segre- 
gated polymers provides the ability to generate a very large universe or repertoire of individually and distinct sequence 
possibilities. As indicated above, particular oligonucleotides may be synthesized in automated fashion at specffic loca- 
tions on a matrix In fact, these oligonucleotides may be used to direct other molecules to specific locations by linking 

is specific oligonucleotides to other reagents which are in batch exposed to the matrix and hybridized in a complementary 
fashion to only those locations where the complementary oligonucleotide has been synthesized on the matrix This 
allows for spatially attaching a plurality of different reagents onto the matrix instead of individually attaching each sep- 
arate reagent at each specific location. Although the caged btotin method allows the automated attachment, the speed 
of the caged biotin attachment process is relatively slow and requires a separate reaction for each reagent being 

20 attached. By use of the oligonucleotide method, the specificity of position can be done in an automated and parallel 
fashion. As each reagent is produced, instead of directly attaching each reagent at each desired position, the reagent 
may be attached to a specific desired complementary oligonucleotide which wiO ultimately be specifically directed 
toward locations on the matrix having a complementary oligonucleotide attached thereat. 

In addition, the technology allows screening for specificity of interaction with particular reagents. For example, the 

25 oligonucleotide sequence specificity of binding of a potential reagent may be tested by presenting to the reagent all of 
the possfcle subsequences available for binding. Although secondary or higher order sequence specific features might 
not be easily screenable using this technology, it does provide a convenient simple, quick, and thorough screen of inter- 
actions between a reagent and its target recognition sequences. See, e.g., Pfeifer et a). (1989) Science 246:810-812. 
For exarrple, the interaction of a promoter protein with its target binding sequence may be tested for many different, 

so or afi, possible binding sequences. By testing the strength of interactions under various different conditions, the interac- 
tion of the promoter protein with each of the different potential binding 6ites may be analyzed. The spectrum of strength 
of interactions with each different potential binding site may provide significant insight into the types of features which 
are important in determining specificity. 

An additional example of a sequence specific interaction between reagents is the testing of binding of a double 

35 stranded nucleic acid structure with a single stranded oligonucleotide. Often, a triple stranded structure is produced 
which has significant aspects of sequence specificity. Testing of such interactions with either sequences comprising 
only natural nucleotides, or perhaps the testing of nucleotide analogs may be very important in screening for particularly 
useful diagnostic or therapeutic reagents. See. eg.. Haner and Dervan (1990) PjpchemjStTV 29:9761-6765, and refer- 
ences therein. 

40 

B. Sequence Comparisons 

Once a gene is sequenced, the present invention provides means to compare alleles or related sequences to 
locate and identify differences from the control sequence. This would be extremely useful in further analysis of genetic 
45 variability at a specific gene locus. 

C. Categorizations 

As indicated above in the fingerprinting and mapping embodiments, the present invention is also useful to define 
so specific stages in the temporal sequence of cells, eg., development and the resulting tissues within an organism. For 
exarrple, the developmental stage of a cell, or population of cells, can be dependent upon the expression of particular 
messenger RNAs. The screening procedures provided allow for high resolution definition of new classes of cells. In 
addition, the temporal development of particular cells will be characterized by the presence or expression of various 
mRNAs. Means to simultaneously screen a plurality or very large number of different sequences as provided. The com- 
55 bination of different markers made available dramatically increases the ability to distinguish fairly closely related cell 
types. Other markers may be combined with markers and methods made available herein to define new classifications 
of biological samples, e.g.. based upon new combinations of markers. 

The presence or absence of particular marker sequences will be used to define temporal developmental stages. 
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Once the stages are defined, fairly simple methods can be applied to actuaDy purify those particular cells. For example, 
anti sense probes or recognition reagents may be used with a cell sorter to select those cells containing or expressing 
the critical markers. Alternatively, the expression of those sequences may result in specific antigens which may also be 
used in defining cell classes and sorting those cells away from others. In this way, for example, it should be possible to 

s select a class of omnipotent immune system ceQs which are able to completely regenerate a human immune system. 
Based upon the cellular classes defined by the parameters made available by this technology, purified classes of cells 
having identifiable differences in RNA expression and/or DNA structure are made available. 

In an alternative embodiment subclasses of T-cells are defined, in part upon the combination of expressed cell 
surface RNA species. The present invention allows for the simultaneous screening of a large plurality of different RNA 

10 species together. Thus, higher resolution classification of different T-cefl subclasses becomes possble and, with the 
definitions and functional differences which correlate with those other parameters, the ability to purify those cell types 
becomes available. This is applicable not only to T-cells, lymphocyte cells, or even to freely circulating cells. Many of the 
cells for which this would be most useful will be immobile cells found in particular tissues or organs. Tumor cells will be 
diagnosed or detected using these fingerprinting techniques. Coupled with a temporal change in structure, develop- 

15 mental classes may also be selected and defined using these technologies. The present invention also provides the 
ability not only to define new classes of cells based upon functional or structural differences, but it also provides the abil- 
ity to select or purify populations of cells which share these particular properties. In particular, antisense DNA or RNA 
molecules may be introduced into a cell to detect RNA sequences therein. See, e g. , Weintraub (1 990) Scientific Amer- 
ican 262:40-46. 

so 

D. Statistical Correlations » 

In an additional embodiment, the present invention also allows for the high resolution correlation of medical condi- 
tions with various different markers. For example, the present technology, when appOed to amniocentesis or other 

25 genetic screening methods, typically screen for tens of different markers at most The present invention allows simulta- 
neous screening for tens, hundreds, thousands, tens of thousands, hundreds of thousands, and even milGons of differ- 
ent genetic sequences, thus, applying the fingerprinting methods of the present invention to a sufficiently large 
population allows detaOed statistical analysts to be made, thereby correlating particular medical conditions with partic- 
ular markers, typicaOy genetic markers or pathognomonic RNA expression patterns. Tumor-specific RNA expression 

50 patterns and particular RNA species characterizing various neoplastic phenotypes will be identified using the present 
invention. 

Various medical conditions may be correlated against an enormous data base of the sequences within an individ- 
ual. Genetic propensities and correlations then become available and high resolution genetic predictability and correla- 
tion become much more easily performed. With the enormous data base, the reliability of the predictions also is better 
35 tested. Particular markers which are partially diagnostic of particular medical conditions or medical susceptibilities wfll 
be identified and provide direction in further studies and more careful analysis of the markers involved. Of course, as 
incficated above in the sequencing embodiment the present invention will find much use in intense sequencing projects. 
For example, sequencing of the entire human genome in the human genome project will be greatly simplified and ena- 
bled by the present invention. 

40 

VI. FORMATION OF SUBSTRATE 

The substrate is provided with a pattern of specific reagents which are positfonally localized on the surface of the 
substrate. This matrix of positions is defined by the automated system which produces the substrate. The instrument 

45 will typically be one similar to that described in PCT publication no. WO90/15070, and U.S.S.N. 07/624,120. The instru- 
mentation described therein is directly applicable to the applications used hera In particular, the apparatus comprises 
a substrate, typicaOy a ©Goon containing substrate, on which positions on the surface may be defined by a coordinate 
system of positions. These positions can be individually addressed or detected by the VLSIPS apparatus. 

Typically, the VLSIPS apparatus uses optical methods used in semiconductor fabrication applications. In this way, 

60 masks may be used to photo-activate positions for attachment or synthesis of specific sequences on the substrate. 
These manipulations may be automated by the types of apparatus described in PCT publication no. WO9Q715070 and 
U.S.S.N. 07/624,120. 

Selectively removable protecting groups allow creation of well defined areas of substrate surface having differing 
reactivities. Preferably, the protecting groups are selectively removed from the surface by applying a specific activator, 
55 such as electromagnetic radiation of a specific wavelength and intensity. More preferably, the specific activator exposes 
selected areas of surface to remove the protecting groups in the exposed areas. 

Protecting groups of the present invention are used in conjunction with solid phase oligonucleotide syntheses using 
deoxyribonucleic and ribonucleic acids. In addition to protecting the substrate surface from unwanted reaction, the pro- 
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tecting groups block a reactive end of the monomer to prevent setf-polymerization. 

Attachment of a protecting group to the S-bydraxyl group of a nucleoside during synthesis using for example, phos- 
phate-triester coupling chemistry, prevents the S'-hydraxyl of one nucleoside from reacting with the ^-activated phos- 
phate-triester of another. 

5 Regardless of the specific use. protecting groups are employed to protect a moiety on a molecule from reacting with 
another reagent Protecting groups of the present invention have the following characteristics: they prevent selected 
reagents from modifying the group to which they are attached; they are stable (that is. they remain attached) to the syn- 
thesis reaction conditions; they are removable under conditions that do not adversely affect the remaining structure; and 
once removed, do not react appreciably with the surface or surface-bound oligonucleotide. 

10 In a preferred embodiment, the protecting groups will be photoactivataWe. The properties and uses of photoreac- 
tive protecting compounds have been reviewed. See, McCray fit fiL, Ann. Rev, of Biophys. and Biophys. Cherrt (1989) 
16239-270, which is incorporated herein by reference. Preferably, the photosensitive protecting groups will be remov- 
able by radiation in the ultraviolet (UV) or visible portion of the electromagnetic spectrum. More preferably, the protect- 
ing groups wOl be removable by racfiation in the near UV or visible portion of the spectrum. In some embodiments, 

15 however, activation may be performed by other methods such as localized heating, electron beam lithography, laser 
pumping, oxidation or reduction with microelectrodes, and the like. Sutfonyl compounds are suitable reactive groups for 
electron beam lithography. Oxidative or reductive removal is accomplished by exposure of the protecting group to an 
electric current source, preferably using microelectrodes directed to the predefined regions of the surface which are 
desired for activation. A more detailed description of these protective groups is provided in U.S.S.N. 07/624,120, which 

20 is hereby incorporated herein by reference. 

The density of reagents attached to a silicon substrate may be varied by standard procedures. The surface area for 
attachment of reagents may be increased by mocffying the silicon surface. For example, a matte surface may be 
machined or etched on the substrate to provide more sites for attachment of the particular reagents. Another way to 
increase the density of reagent binding sites is to increase the derivttization density of the silicon. Standard procedures 

25 for achieving this are descrfoed, below. 

One method to control the derrvatization density is to highly derivatize the substrate with photochemical groups at 
high density. The substrate is then photolyzed for various predetermined times, which photoactivate the groups at a 
measurable rate, and react then with a capping reagent. By this method, the density of linker groups may be modulated 
by using a desired time and intensity of photoactivation. 

so In many applications, the number of different sequences which may be provided may be limited by the density and 
the size of the substrate on which the matrix pattern is generated. In situations where the density is insufficiently high 
to allow the screening of the desired number of sequences, multiple substrates may be used to increase the number of 
sequences tested. Thus, the number of sequences tested may be increased by using a plurality of different substrates. 
Because the VLSIPS apparatus is almost fully automated, increasing the number of substrates does not lead to a sig- 

35 nificant Increase in the number of manipulations which must be performed by humans. This again leads to greater 
reproducibility and speed in the handling of these multiple substrates. 

A. I nstrum entation 

40 The concept of using VLSIPS generally allows a pattern or a matrix of reagents to be generated. The procedure for 
making the pattern is performed by any of a number of different methods An apparatus and instrumentation useful for 
generating a high density VLSIPS substrate is described in detail in PCT publication no. WO90/15070 and U.S.S.N. 
07/624,120. 

45 B.pinyy Making 

The details of the binary masking are described in an accompanying application filed simultaneously with this, 
U.S.S.N. 07/624,120, whose specification is incorporated herein by reference. 

For example, the binary masking technique allows for producing a plurafity of sequences based on the selection of 
50 either of two possibilities at any particular locatioa By a 6eries of binary masking steps, the binary decision may be the 
determination, on a particular synthetic cycle, whether or not to add any particular one of the possible subunits. By 
treating various regions of the matrix pattern in parallel, the binary masking strategy provides the ability to carry out spa- 
tially addressable parallel synthesis. 

ss C. Synthetic M<?thQ<fe 

The synthetic methods in making a substrate are described in the parent application, U.S.S.N. 07/492,462. The 
construction of the matrix pattern on the substrate will typically be generated by the use of photo-sensitive reagents. By 
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use of photo-lithographic optical methods, particular segments of the substrate can be irradiated with light to activate or 
deactivate blocking agents, e.g., to protect or deprotect particular chemical groups. By an appropriate sequence of 
photo-exposure steps at appropriate times with appropriate masks and with appropriate reagents, the substrates can 
have known polymers synthesized at positional^ defined regions on the substrate. Methods for synthesizing various 
substrates are described in PCT publication no. WO90/15070 and U.S.&N. 07/624,120. By a sequential series of these 
photo-exposure and reaction manipulations, a defined matrix pattern of known sequences may be generated, and is 
typically referred to as a VLSIPS substrate. In the nucleic acid synthesis embodiment, nucleosides used in the synthe- 
sis of DNA by photolytic methods win typically be one of the two forms shown below: 



10 



15 



20 
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so 



35 




40 



n 



B ■= Adenine, Cytosine, Guanine, or Thymine 



45 



60 



In I, the photolabile group at the 5 position is abbreviated NV (ritroveratryi) and in II, the group is abbreviated 
NVOC (nrtroveratryl cxycarbonyl). Although not shown above, bases (adenine, cytosine, and guanine) contain exocycfic 
NH2 groups which must be protected during DNA synthesis. Thymine contains no exocydic NH2 and therefore requires 
no protection. The standard protecting groups for these anaines are shown below: 



55 
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Adenine (A) Cytosine (C) Guanine (6) 

Other amides of the general formula 

0 



R R s ALKYU ARYL 



where R may be alky! or aryi have been used. 

Another type of protecting group FMOC (9-f luorenyl methaxycarbonyl) is currently being used to protect the exocy- 
dic amines of the three bases: 



vU o 





Adenine (A) Cytosine (C) Guanine (G) 



The advantage of the FMOC group is that it is removed under mild conditions (dilute organic bases) and can be 
used for ad three bases. The amide protecting groups require more harsh conditions to be removed (NH^/MeOH with 
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heat). 



Nucleosides used as 5'-OH probes, useful in verifying correct VLSIPS synthetic function, have been the following: 



10 



15 




NTKOC 




YY 



m 



20 These compounds are used to detect where on a substrate photolysis has occurred by the attachment of either III 
or V to the newly generated 5'-OH. In the case of III, after the phosphate attachment is made, the substrate is treated 
with a dilute base to remove the FMOC group. The resulting amine can be reacted with FfTC and the substrate exam* 
ined by fluorescence microscopy. This indicates the proper generation of a 5*-OH. In the case of compound IV, after the 
phosphate attachment is made, the substrate is treated with FITC labeled streptavidin and the substrate again may be 

25 examined by fluorescence microscopy. Other probes, although not nucleoside based, have included the following: 



so 



35 




40 



45 





mx 



60 



The method of attachment of the first nucleoside to the surface of the substrate depends on the functionality of the 
groups at the substrate surface. If the surface is amine functionalized, an amide bond is made (see example below). 



55 
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7? . # 



10 



15 



If the surface is hydroxy f unctionalized a phosphate bond is made (see example below) 

20 f 

NVO- 



x w „ , 

30 * 0— P— 0 

OH * 6^ 



CN 



35 

In both cases, the thymidine example is illustrated, but any one of the four phosphoramidite activated nucleosides 
can be used in the first step. 

Photolysis of the photolabile group NV or NVOC on the 5* positions of the nucleosides is earned out at -362 nm 
with an intensity of 14 mW/cm 2 for 1 0 minutes with the substrate 6ide (side containing the photolabile group) immersed 
40 in dioxane. After the coupling of the next nucleoside is complete, the photolysis is repeated followed by another coupling 
until the desired oligomer is obtained. 

One of the most common 3'O-protecting grotp is the ester, in particular the acetate 



hp 



45 i B 

R = CH 3 , C^Hj 

Y 



60 



55 

The groups can be removed by mad base treatment 0.1 N NaOH/MeOH or K 2 CCVH 2 0/MeOK 
Another group ufied most often is the silyl ether. 
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R l' R 2 = CH : 



: 3 ; R 3 = tlfc 



These groups can be removed by neutral conditions using 1 M tetra-n-butytammonium fluoride in THF or under 
acid conditions. 

Related to photodeprotection, the nitroveratryl group could also be used to protect the 3'-pos*rtion. 



Removal of these groups usually involves acid or catalytic methods. 

Note that corresponding linkages and photoblocked amino adds are described in detail In U.S.S.N. 07/624.120. 
which is hereby incorporated herein by reference. 

Although the specificity of interactions at particular locations will usually be homogeneous due to a homogeneous 
polymer being synthesized at each def ined location, for certain purposes, it may be useful to have mixed polymers with 
a commensurate mixed collection of interactions occurring at specific defined locations, or degeneracy reducing ana- 
logues, which have been discussed above and show broad specificity in binding. Then, a positive interaction signal may 
result from any of a number of sequences contained therein. 

As an alternative method of generating a matrix pattern on a substrate, preformed polymers may be individually 
attached at particular sites on the substrata This may be performed by individually attaching reagents one at a time to 
specific positions on the matrix, a process which may be automated. See, eg., U.S.S.N. 07/435,316 (caged biotin par- 
ent), and U.S.S.N. 07/612,671 (caged biotin CIP). Another way ol generating a positionally defined matrix pattern on a 
substrate is to have individually specific reagents which interact with each specific position on the substrate. For exam- 
ple, oligonucleotides may be synthesized at defined locations on the substrate. Then the substrate would have on Its 




Here, fight (photolysis) would be used to remove these protecting groups. 
A variety of ethers can also be used in the protection of the 3*-Oposition. 




R = TR1TYL. BENZYL 
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surface a plurality of regions having homogeneous oligonucleotides attached at each position. 

In particular, at least four different substrate preparation procedures are available for treating a substrate surface. 
They are the standard VLSIPS method, polymeric substrates, Durapore™ and synthetic beads or ftoers. The treatment 
labeled "standard VLSIPS" method is described in U.S.S.N. 07/624,120, and involves applying aminopropyttriethoxysi- 
s lane to a glass surface. 

The polymeric substrate approach involves either of two ways of generating a polymeric substrata The first uses a 
high concentration of amiropropyttriethoxysilane (2-20%) in an aqueous ethanol solution (95%). This allows the silane 
compound to polymerize both in solution and on the substrate surface, which provides a high density of amines on the 
surface of the glass. This density is contrasted with the standard VLSIPS method. This polymeric method allows for the 
10 deposition on the substrate surface of a monolayer due to the anhydrous method used with the aforementioned silane. 

The second polymeric method involves either the coating or covalent binding of an appropriate acrylic acid polymer 
onto the substrate surface. In particular, e.g.. in DNA synthesis, a monomer such as a rrydimypropylacrytate is used to 
generate a high density of hydroxyl groups on the substrate surface, allowing for the formation of phosphate bonds. An 
example of such a compound is shown: 

15 




Sl(OCH 2 CH 3 ) 3 



20 0 



25 The method using a Durapore™ membrane (MOIipore) consists of a pdyvinylidine drfluoride coating with 
crossGnked polyhydroxylpropyl acrylate [PVDF-HPA] : 



30 



35 



OH 



Here the buOding up of, ag., a DNA oligomer, can be started immediately since phosphate bonds to the surface can be 
40 accomplished in the first step with no need for modrflcatioa A nucleotide dimer (S'-C-T-S*) has been successfully made 
on this substrate in our labs. 

The fourth method utilizes synthetic beads or fibers- This would use another substrate, such as a teflon copolymer 
graft bead or f ber, which is covalentiy coated with an organic layer (hydrophffic) terminating In hydroxyl sites (commer- 
cially available from Molecular Brosystems, ha) This would offer the same advantage as the Durapore™ membrane, 

45 allowing for immediate phosphate Gnkages, but would give additional contour by the 3-dimensional growth of oligomers. 
A matrix pattern of new reagents may be targeted to each specific oligonucleotide position by attaching a comple- 
mentary oligonucleotide to which the substrate bound form is complementary. For instance, a number of regions may 
have homogeneous oligonucleotides synthesized at various locations. Oligonucleotide sequences complementary to 
each of these can be individually generated and linked to a particular specific reagents. Often these specific reagents 

so wiObe arrtixxfes. As each of these is specific for finding its complementary oligonucleotide, each of the specific rea- 
gents wOl bind through the oligonucleotide to the appropriate matrix position. A single step having a combination of dif- 
ferent specific reagents being attached specifically to a particular oligonucleotide wfll thereby bind to Hs complement at 
the defined matrix position. The oligonucleotides wfll typically then be covalentiy attached, using, e.g., an acridine dye. 
fa photocrosslinWng. Psoralen is a commonly used acridine dye for photocros6linkfng purposes, see, e.g., Song et al. 

55 (1979) Photochem. Photobiol. 29:1177-1197: Cimino et al. (1985) Ann. Rev. Biochem . 54:1151-1193; Parsons (1980) 
Photochem. Photobiol. 32:813-821; and Dattagupta et al. (1985) U.S. Pal No. 4,542,102, and (1987) U.S. Pat. No. 
4,713.326; each of which is hereby incorporated herein by reference. This method allows a single attachment manipu- 
lation to attach all of the specific reagents to the matrix at defined positions and results in the specific reagents being 
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homogeneously located at defined positions. 
D. Surface Immobilization 
6 1. caged biotin 

An alternative method of attaching reagents in a positional)/ defined matrix pattern is to use a caged biotin system. 
See U.S.S.N. 07/612,671 (caged biotin CIP). which is hereby incorporated herein by reference, for additional details on 
the chemistry and application of caged biotin embodiments. In short, the caged biotin has a photosensitive blocking 
to moiety which prevents the combination of avidin to biotin. At positions where the photo-lithographic process has 
removed the blocking group, high affinity biotin sites are generated. Thus, by a sequential series of photolithographic 
deblocking steps interspersed with exposure of those regions to appropriate biotin containing reagents, only those loca- 
tions where the deblocking takes place wfll form an avicfin-biotin interaction. Because the avidin-biotin binding is very 
tight, this will usually be virtually irreversible binding. 

15 

2. crosslinked interactions 

The surface immobilization may also take place by photocrossl inking of defined oligonucleotides finked to specific 
reagents. After hybridization of the complementary oligonucleotides, the oligonucleotides may be crosslinked by a rea- 

20 gent by psoralen or another similar type of acridine dye. Other useful crosslinking reagents are described in Dattagupta 
et al. (1985) US. Pat. No. 4.542,102, and (1987) US. Pat No. 4,713,326. 

In another embodiment colony or phage plaque transfer of biological polymers may be transferred cfirectfy onto a 
silicon substrate. For example, a colony plate may be transferred onto a substrate having a generic oligonucleotide 
sequence which hybridizes to another generic complementary sequence contained on ail of the vectors into which 

25 inserts are cloned. This will specifically only bind those molecules which are actually contained in the vectors containing 
the desired complementary sequence. This immobQisation allows fa producing a matrix onto which a sequence spe- 
cific reagent can bind, or for other purposes. In a further embodiment a plurality of (Efferent vectors each having a spe- 
cific oligonucleotide attached to the vector may be specifically attached to particular regions on a matrix having a 
complementary oligonucleotide attached thereto. 

30 

VIII. HYBRIDIZATION/SPECIFIC INTERACTION 
A. General 

35 As discussed previously in the VLSIPS parent applications, the VLSIPS substrates may be used for screening for 
specif ic interactions with sequence specif ic targets or probes. 

In addition, the svaHability of substrates having the entire repertoire of possble sequences of a defined length 
opens up the possfoiGty of sequencing by hybridization. This sequence may be de novo determination of an unknown 
sequence, particularly of nucleic acid, verification of a sequence determined by another method, or an investigation of 

40 changes in a previously sequenced gene, locating and identifying specific changes. For example, often Maxam and Gil- 
bert sequencing techniques are applied to sequences which have been determined by Sanger end Coulsoa Each of 
those sequencing technologies have problems with resolving particular types of sequences. Sequencing by hybridiza- 
tion may serve as a third and independent method for verifying other sequencing techniques. See, e.g.. (1 988) Science 
242:1245. 

45 In addition, the ability to provide a large repertoire of particular sequences allows use of short subsequence and 
hybridization as a means to fingerprint a polynucleotide sample. For example, fingerprinting to a high degree of specif- 
icity of sequence matching may be used for identifying highly similar samples, e.g., those exhibiting high homology to 
the selected probes. This may provide a means for determining classifications of particular sequences. This should 
allow determination of whether particular genomes of bacteria, phage, or even higher cells might be related to one 

so another. 

In addition, fingerprinting may be used to identify an individual source of biological sample. See, e.g., Lander, E. 
(1 989) Nature. 339:501-505, and references therein. For example, a DNA fingerprint may be used to determine whether 
a genetic sample arose from another individual. This would be particularly useful in various sorts of forensic tests to 
determine, eg., paternity or sources of blood samples. Significant detail on the particulars of genetic fingerprinting for 
55 identification purposes are described in, e.g., Morris et al. (1989) "Biostatistical evolution of evidence from continuous 
allele frequency distribution DNA probes in reference to disputed paternity of identity," J. Forensic Science 34:1311- 
1317; and Neufeld et al. (1990) Scientific American 262:46-53; each of which is hereby incorporated herein by refer- 
ence. 
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In another embodiment, a fingerprinting-Jike procedure may be used for classifying cell types by analyzing a pattern 
of specific nucleic acids present in the cell, specifically RNA expression patterns. This may also be useful in defining 
the temporal stage of development of cells, eg., stem cells or other ceils which undergo temporal changes in develop- 
ment For example, the stage of a cell, or group of cells, may be tested or defined by isolating a sample of mRNA from 

5 the population and testing to see what sequences are present in messenger populations. Direct samples, or amplified 
samples (e.g.. by polymerase chain reaction), may be used. Where particular mRNA or other nucleic acid sequences 
may be characteristic of or shown to be characteristic of particular developmental stages, physiological 6tates, or other 
conditions, this fingerprinting method may define them. 

The present invention may also be used fa mapping sequences within a larger segment. This may be performed 

10 by at least two methods, particularly in reference to nucleic acids. Often, enormous segments of DNA are subdoned 
into a large plurality of subsequences. Ordering these subsequences may be important in determining the overlaps of 
sequences upon nucleotide determinations. Mapping may be performed by immobilizing particularly large segments 
onto a matrix using the VLSIPS technology. Alternatively, sequences may be ordered by virtue of subsequences shared 
by overlapping segments. See. e.g.. Craig et al. (1990) Nuc. Acids Res. 182653-2660; Michiels et al. (1987) CABIOS 

15 3203-210; and Olson et a!. (1986) Proc. Natl. Acad. Sci. USA 83:7826-7830. 

B. Important Parameters 

The extent of specific interaction between reagents immobilized to the VLSIPS substrate and another sequence 

20 specific reagent may be modified by the conditions of the interaction Sequencing embodiments typically require high 
fidelity hybridization and the ability to discriminate perfect matching from imperfect matching. Fingerprinting and map- 
ping embodiments may be performed using less stringent conditions, or in some embodiments very highly stringent 
conditions, depending upon the circumstances. 

In a nucleic acid hybridisation embodiment the specificity and kinetics of hybridization have been described in 

25 detail by, eg., Wetmur and Davidson (1968) J. Md. Biol.. 31 549-370, Britten and Kbhne (1968) Science 161 529-530, 
and Kanehisa. (1984) Nuc. Acids Res. 12203-213, each of which is hereby incorporated herein by reference. Param- 
eters which are wed known to affect specificity and kinetics of reaction include salt conditions, ionic composition of the 
solvent hybricfization temperature, length of oligonucleotide matching sequences, guanine and cytostne (GC) content 
presence of hybridization accelerators, pH, specific bases found in the matching sequences, solvent conditions, and 

so addition of organic solvents. 

In particular, the salt conditions required for driving highly mismatched sequences to completion typically include a 
high salt concentration. The typical salt used is sodium chloride (NaCf), however, other ionic salts may be utilized, ag.. 
KCI. Depending on the desired stringency hybridization, the salt concentration will often be less than about 3 molar, 
more often less than 2.5 molar, usually less than about 2 molar, and more usually less than about 1.5 molar. For appO- 

ss cations directed towards higher stringency matching, the salt concentrations would typically be lower. Ordinary high 
stringency conditions wiQ utiBze salt concentration of less than about 1 molar, more often less then about 750 rrtillimolar, 
usually less than about 500 rrtillimolar, and may be as low as about 250 or 150 rtti&molar. 

The kinetics of hybridization and the stringency of hybridization both depend ipon the temperature at which the 
hybridization is performed and the temperature at which the washing steps are performed. Temperatures at which steps 

40 for low stringency hybridization are desired would typically be lower temperatures, e.g., ordinarily at least about 15°C, 
more ordinarily at least about 20°C, usually at least about 25°C, and more usually at least about 30°C. For those appli- 
cations requiring hi£i stringency hybridisation, or fidelity of hybridization and sequence matching, temperatures at 
which hybridization and washing 6teps are performed would typically be high. For example, temperatures in excess of 
about 35°C would often be used, more often in excess of about 40°C, usually at least about 45°C, and occasionally 

45 even temperatures as high as about 50°C or 60°C or more. Of course, the hybridization of oligonucleotides may be dis- 
rupted by even higher temperatures. Thus, for stripping of targets from substrates, as discussed below, temperatures 
as high as 80°C, or even higher may be used. 

The base composition of the specific oligonucleotides involved in hybridization affects the temperature of melting, 
and the stability of hybridization as discussed in the above references. However, the bias of GC rich sequences to 

60 hybridize faster and retain stability at higher temperatures can be compensated for by the inclusion in the hybridization 
incubation a wash steps of various buffers. Sample buffers which accomplish this result include the triethly-and trime- 
thyl ammonium buffers. See. e.g., Wood et al. (1987) Proc. Natl. Acad. Sci. USA. 82:1585-1588, and Khrapto, K. et al. 
(1989) FEBS Letters 256:1 18-122. 

The rate of hybridization can also be affected by the inclusion of particular hybridization accelerators. These hybrid- 

55 ization accelerators include the volume exclusion agents characterized by dextran sulfate, or polyethylene glycol (PEG). 
Dextran sulfate is typically included at a concentration of between 1% and 40% by weight The actual concentration 
selected depends upon the application, but typically a faster hybridization is desired in which the concentration is opti- 
mized for the system In question. Dextran sulfate is often included at a concentration of between 0.5% and 2% by 
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weight or dextran sulfate at a concentration between about 0.5% and 5%. Alternatively, proteins which accelerate 
hybridization may be added, ag.. the recA protein found in E. coli) or other homologous proteins. 

Of course, the specific hybridization conditions will be selected to correspond to a discriminatory condition which 
provides a positive signal where desired but faOs to show a positive signal at affinities where interaction is not desired, 
s This may be determined by a number of titration steps or with a number of controls which will be run during the hybrid- 
ization and/or washing steps to determine at what point the hybridization conditions have reached the stage of desired 
specif city. 

IX. DETECTION METHODS 

10 

Methods for detection depend upon the label selected. The criteria for selecting an appropriate label are discussed 
below, however, a fluorescent label is preferred because of its extreme sensitivity and simplicity. Standard labeling pro- 
cedures are used to determine the positions where interactions between a sequence and a reagent take place. For 
example, if a target sequence is labeled and exposed to a matrix of different probes, only those locations where probes 
is do interact with the target will exhibit any signal. Alternatively, other methods may be used to scan the matrix to deter- 
mine where interaction takes place. Of course, the spectrum of interactions may be determined in a temporal manner 
by repeated scans of interactions which occur at each of a multiplicity of conditions. However, instead of testing each 
individual interaction separately, a multiplicity of sequence interactions may be simultaneously determined on a matrix. 

20 A. Labeling Techniques 

The target polynucleotide may be labeled by any of a number of convenient detectable markers. A fluorescent label 
is preferred because it provides a very strong signal with low background, ft is also optically detectable at high resolu- 
tion and sensitivity through a quick scanning procedure. Other potential labeling moieties include, radioisotopes, chemi- 
25 luminescent compounds, labeled binding proteins, heavy metal atoms, spectroscopic markers, magnetic labels, and 
linked enzymes. ■ 

Another method for labeling does not require incorporation of a labeling moiety. The target may be exposed to the 
probes, and a double strand hybrid is formed at those positions only. Addition of a double strand specific reagent will 
detect where hybridization takes place. An intercalative dye such as ethidium bromide may be used as long as the 
so probes themselves do not fold back on themselves to a significant extent forming hairpin loops. See, e.g., Sheldon et 
al. (1986) U.S. Pat No. 4,582.789. However, the length of the hairpin loops in short oligonucleotide probes would typi- 
cally be insufficient to form a stable duplex. 

In another embodiment, different targets rnay be simultaneously sequenced where each target has a different label. 
For instance, one target could have a green fluorescent label and a second target could have a red fluorescent label. 
55 The scanning step will distinguish sites of binding of the red label from those binding the green fluorescent label. Each 
sequence can be analyzed independently from one another. 

Suitable chromogens will include molecules and compounds which absorb fight in a distinctive range of wave- 
lengths so that a color may be observed, or emit light when irradiated with radiation of a particular wave length or wave 
length range, e.g., fluoresces. 

40 A wide variety of suitable dyes are available, being primary chosen to provide an intense color with minimal absorp- 
tion by their surroundings. Illustrative dye types include qunoline dyes, triarylmethane dyes, acridine dyes, alizarine 
dyes, phthaleins, insect dyes, azo dyes, anthraqutnoid dyes, cyanine dyes, phenazathionjum dyes, and phenazoxonium 
dyes. 

A wide variety of fluoresces may be employed ether by themselves a in conjunction with quencher molecules. 

45 Fluoresces of interest fall into a variety of categories having certain primary functionalities. These primary functionali- 
ties include 1- and 2-aminonaphthalene, p.p'Kfjaminostibenes. pyrenes, quaternary phenanthridine salts, 9-eminoacri- 
dines, pp'-diamirrobenzopnenone (mines, anthracenes, oxacarbocyanine, merocyanine, 3-aminoequilenin, perylene, 
bis-benzoxazole, bis-p-oxazolyi benzene, 1,2-benzophenazin, retinol, bis-3-amirwpyricfiraum salts, hellebrigenin, tetra- 
cycline, sterophenol, benzimkizaoiylphenytamine. 2-oxo-3-chromen, indole, xanthen, 7-hydroxyooumarin. phenox- 

60 azine. salicylate, strophanthidin, porphyrins, trlarytmethanes and flavin. Individual fluorescent compounds which have 
functionalities for finking or which can be modified to incorporate such functionalities include, eg., dansyl chloride; flu- 
oresceins such as 8,&<firtydro^-9i3heny1xanthhydrol; rhodarnineisothiocyanate; N-phenyl 1-amirK>8-sulfonatonaph- 
thalene; N-phenyl 2-arnino-6-suHonatonaphthalene; 4-acetamido^sothio<^anato-6tflbene-2,2 , <lisulfonic acid; 
pyrene-3-suffonic add; 24otuktinonaphthaJene-6-6uffonate; N-phenyl, N-m ethyl 2-arninoaphthalene-6-sulfonate; ethid- 

55 ium bromide; stebrine; auromine^,2^9'-arrthroy1)palmitatei dansyl phosphatidyiethanolamine; N.N'-dioctadecyl 
oxacarbocyanine; N.N'-dihexyl oxacarbocyanine; merocyanine. 4-(3lpyrenyl)butyrate; d-3-aminodesoxy-equilenin; 12- 
(9*-anthroy1)stearate; 2-methylanthracene; 9-vinytanthracene; 2,2 f -(vinylene-p-pheny1ene)bi6benzoxazole; p-bis(2-(4- 
methyl-Sphenyl-oxazolyOlbenzene; 6-dimethylamino-1,2-benzophenazin; retinol; Ws(3*-amirK>pyridinium) 1,10-decan- 
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diyl diiodide; suHonaphthyihydrazone of hellforienin; chlorotetracydine; N^7^imethytamncM^ethy1-2<)xo^^rofTO- 
nyi)maleimide; N-[pK2*enzimidtotyf)i3heny(]maleimide; N-(44luoranthy1)maleimide; bis(homovanillic acid); 
resazarin; 4-chloro-7-rtrtro-2, 1 ,3-benzooxadiazole; merocyanine 540; resorufin; rose bengal; and 2,4-dipheny1-3(2H)- 
furanone. 

Desirably. f luorescers should absorb light above about 300 nm t preferably about 350 nm, and more preferably 
above about 400 nm, usually emitting at wavelengths greater than about 10 nm higher than the wavelength of the light 
absorbed. It should be noted that the absorption and emission characteristics of the bound dye may differ from the 
unbound dye. Therefore, when referring to the various wavelength ranges and characteristics of the dyes, it is intended 
to indicate the dyes as employed and not the dye which is unconjugated and characterized in an arbitrary solvent 

Fluoresces are generally preferred because by irradiating a fluorescer with light, one can obtain a plurality of emis- 
sions. Thus, a single label can provide for a plurality of measurable events. 

Detectable signal may also be provided by chemiluminescent and bioluminescent sources. Chemiluminescent 
sources include a compound which becomes electronically excited by a chemical reaction and may then emit light 
which serves as the detectable signal or donates energy to a fluorescent acceptor. A diverse number of families of com- 
pounds have been found to provide chemiluminescence under a variety of conditions. One family of compounds is 2.3- 
dihydro-1 ,-4-phthalazinedione. The most popular compound is luminol. which is the 5-amino compound. Other mem- 
bers of the family include the 5-amino-6, 7,8-trimethaxy- and the dimethytamino[ca]benz analog. These compounds can 
be made to luminesce with alkaline hydrogen peroxide or calcium hypochlorite and base. Another family of compounds 
is the 2,4,5-triphenytimidazoles, with lophine as the common name for the parent product Chemiluminescent analogs 
include para-dimethylamino and -methoxy substituents. Chemiluminescence may also be obtained with oxalates, usu- 
ally oxalyl active esters, e.g.. p-nitrophenyt and a peroxide, e.g.. hydrogen peroxide, under basic conditions. Alterna- 
tively, tuciferins may be used in conjunction with luciferase or luctgerdns to provide biduminescence. 

Spin labels are provided by reporter molecules with an unpaired electron spin which can be detected by electron 
spin resonance (ESR) spectroscopy. Exemplary spin labels include organic free radicals, transitional metal complexes, 
particularly vanadium, copper, iron, and manganese, and the like. Exemplary spin labels include nrtroxide free radicals. 

B. Scanning System 

With the automated detection apparatus, the correlation of specific positional labeling is converted to the presence 
30 on the target of sequences for which the reagents have specificity of interaction. Thus, the positional information is 
directly converted to a database indicating what sequence interactions have occurred. For example, in a nucleic acid 
hybridization application, the sequences which have interacted between the substrate matrix and the target molecule 
can be directly fisted from the positional information. The detection system used is described in PGT publication no. 
WO90/15070; and U.S.S.N. 07/B24.120. Although the detection described therein is a fluorescence detector, the detec- 
ss tor may be replaced by a spectroscopic or other detector. The scanning system may make use of a moving detector 
relative to a fixed substrate, a fixed detector with a moving substrate, or a combination. Alternatively, mirrors or other 
apparatus can be used to transfer the signal directly to the detector. See, eg. U.S.S.N. 07/624,120. which is hereby 
incorporated herein by reference. 

The detection method wiO typically also Incorporate some signal processing to determine whether the signal at a 
40 particular matrix position is a true positive or may be a spurious signal. For example, a signal from a region which has 
actual positive signal may tend to spread over and provide a positive signal in an adjacent region which actually should 
not have one. This may occur, e.g., where the scanning system is not properly discriminating with sufficiently high res- 
olution In Its pixel density to separate the two regions. Thus, the signal over the spatial region may be evaluated pixel 
by pixel to determine the locations and the actual extent of positive signal. A true positive signal should, in theory, show 
4$ a uniform signal at each pixel location. Thus, processing by plotting number of pixels with actual signal intensity should 
have a clearly uniform signal intensity. Regions where the signal intensities show a fairly wide dispersion, may be par- 
ticularly suspect and the scanning system may be programmed to more carefully scan those positions. 

In another embodiment as the sequence of a target is determined at a particular location, the overlap for the 
sequence would necessarfly have a known sequence Thus, the system can compare the posstoiGties for the next adja- 
so cent position and look at these in comparison with each other. Typically, only one of the possible adjacent sequences 
should give a positive signal and the system might be programmed to compare each of these possibilities and select 
that one which gives a strong positive. In this way, the system can also simultaneously provide some means of meas- 
uring the reliability of the determination by indicating what the average signal to background ratio actually is. 

More sophisticated signal processing techniques can be applied to the initial determination of whether a positive 
55 signal exists or not. See, e.g., U.S.S.N. 07/624,120. 

From a listing of those sequences which interact data analysis may be performed on a series of sequences. For 
example, in a nucleic acid sequence application, each of the sequences may be analyzed for their overlap regions and 
the original target sequence may be reconstructed from the collection of specific subsequences obtained therein. Other 
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sorts of analyses for different applications may also be performed, and because the scanning system directly interfaces 
with a computer the information need not be transferred manually. This provides for the ability to handle large amounts 
of data with very little human intervention. This, of course, provides significant advantages over manual manipulations. 
Increased throughput and reproducfcility is thereby provided by the automation of vast majority of steps in any of these 
s applications. 

DATA ANALYSIS 

A, Qfrnygl 

10 

Data analysis will typically involve aligning the proper sequences with their overlaps to determine the target 
sequence. Although the target "sequence" may not specifically correspond to any specific molecule, especially where 
the target sequence is broken and fragmented up in the sequencing process, the sequence corresponds to a contigu- 
ous sequence of the subfragments. 

is The data analysis can be performed by a computer using an appropriate program. See, e.g.. Drmanac. R. et al. 
(1989) Genomics 4: 11 4-128; and a commercially available analysis program available from the Genetic Engineering 
Center, P.O. Box 794, 1 1000 Belgrade, Yugoslavia. Although the specific manipulations necessary to reassemble the 
target sequence from fragments may take many forms, one embodiment uses a sorting program to sort all of the sub- 
sequences using a def ined hierarchy. The hierarchy need not necessarily correspond to any physical hierarchy, but pro- 

20 vides a means to determine, in order, which subfragments have actually been found in the target sequence. In this 
manner, overlaps can be checked and found directly rather than having to search throughout the entire set after each 
selection process. For example, where the oligonucleotide probes are 10-mers, the first 9 positions can be sorted. A 
particular subsequence can be selected as in the examples, to determine where the process starts. As analogous to 
the theoretical example provided above, the sorting procedure provides the ability to immediately find the position of the 

25 subsequence which contains the first 9 positions and can compare whether there exists more than 1 subsequence dur- 
ing the first 9 positions. In fact, the computer can easily generate all of the possible target sequences which contain 
given combination of subsequences. Typically there wifl be only one. but in various situations, there will be more. 

An exemplary flow chart for a sequencing program is provided in Figure 4. In general terms, the program provides 
for automated scanning of the substrate to determine the positions of probe and target interaction. Simple processing 

so of the intensity of the signal may be incorporated to filter out dearly spurious signals. The positions with positive inter- 
action are correlated with the sequence specificity of specific matrix positions, to generate the set of matching subse- 
quences. This information is further correlated with other target sequence information, e.g., restriction fragment 
analysis. The sequences are then aligned using overlap data, thereby leading to possible corresponding target 
sequences which will, optimally, correspond to a single target sequence. 

35 

B. Hardware 

A variety of computer systems may be used to run a sequencing program. The program may be written to provide 
both the detecting and scanning steps together and will typically be dedicated to a particular scanning apparatus. How- 

40 ever, the components and functional steps may be separated and the scanning system may provide an output e.g.. 
through tape or an electronic connection into a separate computer which separately runs the sequencing analysis pro- 
gram. The computer may be any of a number of machines provided by standard computer manufacturers. e.g.. IBM 
compatible machines, Apple™ machines, VAX machines, and others, which may often use a UNIX™ operating system. 
Alternatively, custom computing architectures may be employed, these architectures may include neural network meth- 

45 ods implemented in hardware and/or software Of course, the hardware used to run the analysis program will typically 
determine what programming language would be used. 

c. Software 

60 Software would be readily developed by a person of ordinary skill in the programming art. following the flow chart 
provided, or based upon the input provided and the desired result 

Of course, an exemplary embodiment is a polynucleotide sequence system. However, the theoretical and mathe- 
matical manipulations necessary for data analysis of other linear molecules are conceptually similar. 

55 XI. SUBSTRATE REUSE 

Where a substrate is made with specific reagents that are relatively insensitive to the handling and processing 
steps involved in a single cycle of use, the 6tbstrate may often be reused. The target molecules are usually stripped off 
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of the solid phase specific recognition molecules. Of course, it is preferred that the manipulations and conditions be 
selected as to be mild and to not affect the substrate. For example, if a substrate is acid labile, a neutral pH would be 
preferred in all handling steps. Similar sensitivities would be carefully respected where recycling is desired. 

5 A. Removal of Label 

Typically for a recycling, the previously attached specific interaction would be disrupted and removed. This will typ- 
ically involve exposing the substrate to conditions under which the interaction between probe and target is disrupted. 
Alternatively, it may be exposed to conditions where the target is destroyed. For example, where the probes are oligo- 
10 nucleotides and the target is a polynucleotide, a heating and low salt wash will often be sufficient to disrupt the interac- 
tions. Additional reagents may be added such as detergents, and organic or inorganic solvents which disrupt the 
interaction between the specific reagents and target 

B. Storage and Preservation 

15 

As indicated above, the matrix will typically be maintained under conditions where the matrix itself and the Gnkages 
and specific reagents are preserved. Various specific preservatives may be added which prevent degradation. For 
example, if the reagents are acid or base labile, a neutral pH buffer will typically be added. It is also desired to avoid 
destruction of the matrix by growth of organisms which may destroy organic reagents attached thereto For this reason, 
20 a preservative such as cyanide or azide may be added. However, the chemical preservative should also be selected to 
preserve the chemical nature of the linkages and other components of the substrate. Typically, a detergent may also be 
included. 

C. Processes to Avoid Degradation of Oligomers 

25 

In particular, a substrate comprising a large number of oligomers will be treated in a fashion which is known to 
maintain the quality and integrity of oligonucleotides. These include storing the substrate in a carefully controlled envi- 
ronment under conditions of lower temperature, cation depletion (EDTA and EGTA), sterile conditions, and inert argon 
or nitrogen atmosphere. 

30 

XII. INTEGRATED SEQUENCING STRATEGY 

A. Initial Mapping Strategy 

ss As indicated above, although the VLSIPS may be applied to sequencing embodiments, it is often useful to integrate 
other concepts to simply the sequencing. For example, nucleic acids may be easily sequenced by careful selection of 
the vectors and hosts used for amplifying and generating the specific target sequences. For exarrple, it may be desired 
to use specific vectors which have been designed to interact most efficiently with the VLSIPS substrate. This is also 
Important in fingerprinting and mapping strategies. For example, vectors may be carefully selected having particular 

40 complementary sequences which are designed to attach to a genetic or specific oligomer on the substrate. This is also 
appGcable to situations where it is desired to target particular sequences to specific locations on the matrix. 

In one embodiment unnatural oligomers may be used to target natural probes to specific locations on the VLSIPS 
substrate. In addition, particular probes may be generated lor the mapping embodiment which are designed to have 
specific combinations of characteristics. For example, the construction of a mapping substrate may depend upon use 

45 of another automated apparatus which takes clones isolated from a chromosome walk and attaches them individually 
or in bulk to the VLSIPS substrate. 

In another embodiment a variety of specific vectors having known and particular targeting" sequences adjacent 
the cloning sites may be individually used to clone a selected probe, and the isolated probe win then be targetable to a 
site on the VLSIPS substrate with a sequence complementary to the larger sequence 

60 

B. Selection of Smaller Clones 

In the fingerprinting and mapping embocfiments, the selection of probes may be very important Significant mathe- 
matical analysis may be applied to determine which specific sequences should be used as those probes. Of course, for 
55 fingerprinting use, sequences that show significant heterogeneity across the human population would be preferred. 
Selection of the specific sequences which would most favorably be utilized wfll tend to be single copy sequences within 
the genome, and more specifically single copy sequences that have low cross-hybridization potential to other 
sequences in the genome (i.e., not members of a closely-related muttigene family). 
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Various hybridization selection procedures may be applied to select sequences which tend not to be repeated 
within a genome, and thus would tend to be conserved across individuals. For example, hybridization selections may 
be made tor non-repetitive and single copy sequences. See. e.g.. Britten and Kbhne (1968) "Repeated sequences in 
DNV Sconce 161:529-540. On the other hand, it may be desired under certain circumstances to use repeated 
sequences. For example, where a fingerprint may be used to identify or distinguish different species, or where repetitive 
sequences may be diagnostic of specific species, repetitive sequences may be desired for inclusion in the fingerprinting 
probes. In either case, the sequencing capability will greatly assist in the selection of appropriate sequences to be used 
as probes. 

Also as indicated above, various means for constructing an appropriate substrate may involve either mechanical or 
automated procedures. The standard VLSIPS automated procedure involves synthesizing oligonucleotides or short 
polymers directly on the substrate. In various other embodiments, it is possible to attach separately synthesized rea- 
gents onto the matrix in an ordered array. Other circumstances may lend themselves to transfer a pattern from a petri 
plate onto a solid substrate. Also, there are methods for site specifically directing collections of reagents to specific loca- 
tions using unnatural nucleotides or equivalent sorts of targeting molecules. 

While a brute force manual transfer process may be utilized sequentially attaching various samples to successive 
positions, instrumentation for automating such procedures may also be devised. The automated system for performing 
such would preferably be relatively easily designed and conceptually easily understood. 

XIII. COMMERCIAL APPLICATIONS 

A. Sequencing 

As indicated above, sequencing may be performed either de novo or as a verification of another sequencing 
method. The present hybridization technology provides the ability to sequence nucleic acids and polynucleotides de 
novo, or as a means to verify ether the Maxam and Gilbert chemical sequencing technique or Sanger and Coulson 
dideoxy- sequencing techniques. The hybridization method s useful to verify sequencing determined by any other 
sequencing technique and to closely compare two similar sequences, e.g.. to identify and locate sequence differences. 

Of course, sequencing of can be very important in many different sorts of environments. For example, it will be use- 
ful in determining the genetic sequence of particular markers in various individuals. In addition, polymers may be used 
as markers or for information containing molecules to encode information. For example, a short polynucleotide 
sequence may be included in large bulk production samples indicating the manufacturer, date, and location of manu- 
facture of a product For example, various drugs may be encoded with this information with a small number of molecules 
in a batch. For example, a piO may have somewhere from 10 to 1 00 to 1 ,000 or more very short and small molecules 
encoring this information. When necessary, this information may be decoded from a sample of the material using a 
polymerase chain reaction (PCR) or other amplification method. This encoding system may be used to provide the ori- 
gin of large bulky samples without significantly affecting the properties of those samples. For example, chemical sam- 
ples may also be encoded by this method thereby providing means for identifying the source and manufacturing details 
of tots. The origin of bulk hydrocarbon samples may be encoded. Production tots of organic compounds such as ben- 
zene or plastics may be encoded with a short molecule polymer. Food stuffs may also be encoded using similar marking 
molecules. Even toxic waste samples can be encoded determining the source or origin. In this way, proper disposal can 
be traced or more easily enforced. 

Simflar sorts of encoding may be provided by fingerprinting-type analysis. Whether the resolution is absolute or 
less so, the concept of coring Information on molecules such as nucleic acids, which can be amplified and later 
decoded, may be a very useful and important application. 

This technology also provides the ability to include markers for origins of biological materials. Fa example, a pat- 
ented animal line may be transformed with a particular unnatural sequence which can be traced back to its origin. With 
a selection of multiple markers, the likelihood could be negligible that a combination of markers would have independ- 
ently arisen from a source other than the patented or specifically protected source. This technique may provide a 
means for tracing the actual origin of particular biological materials. Bacteria, plants, and animals will be subject to 
marking by such encoding sequences. 

B. Rngerprinting 

As indicated above, fingerprinting technology may also be used for data encryption. Moreover, fingerprinting allows 
for significant identification of particular individuals. Where the fingerprinting technology is standardized, and used for 
identification of large numbers of people, related equipment and peripheral processing will be developed to accompany 
the underlying technology. For example, specific equipment may be developed for automatically taking a biological 
6ample and generating or amplifying the information molecules within the sample to be used in fingerprinting analysis. 
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Moreover, the fingerprinting substrate may be mass produced using particular types of automatic eqiipment Synthetic 
equipment may produce the entire matrix simultaneously by stepwise synthetic methods as provided by the VLSIPS 
technology. The attachment of specific probes onto a substrate may also be automated, e.g., making use of the caged 
biotin technology. See, e.g., U.S.S.N. 07/512,671 (caged biotin CIP). 

5 In addition, peripheral processing may be important and may be dedicated to this specific application. Thus, auto- 
mated equipment for producing the substrates may be designed, or particular systems which take in a biological sample 
and output either a computer readout or an encoded instrument e.g.. a card or document which indicates the informa- 
tion and can provide that information to others. An identification having a short magnetic strip with a few million bits may 
be used to provide individual identification and important mecfcal information useful in a medical emergency. 

w In fact, data banks may be set up to correlate at! of this information of fingerprinting with medical information. This 
may allow for the determination of correlations between various medical problems and specific DNA sequences. By col- 
lating large populations of medical records with genetic information, genetic propensities and genetic susceptibilities to 
particular medical conditions may be developed. Moreover, with standardization of substrates, the micro encoding data 
may be also standardized to reproduce the information from a centralized data bank a on an encoding device carried 

is on an individual person. On the other hand, if the fingerprinting procedure is sufficiently quick and routine, every hospi- 
tal may routinely perform a fingerprinting operation and from that determine many important medical parameters for an 
individual. 

In particular industries, the VLSIPS sequencing, fingerprinting, or mapping technology will be particularly appropri- 
ate. As mentioned above, agricultural livestock suppliers may be able to encode and determine whether their particular 

20 strains are being used by others. By incorporating particular markers into their genetic stocks, the markers will indicate 
origin of genetic material. This is applicable to seed producers, livestock producers, and other suppliers of medical or 
agricultural biological materials. 

This may also be useful in identifying individual animals or plants. For example, these markers may be useful in 
determining whether certain fish return to their original breeding grounds, whether sea turtles always return to their 

25 original birthplaces, or to determine the migration patterns and viability of populations of particular endangered species. 
It would also provide means for tracking the sources of particular animal products. For example, it might be useful for 
determining the origins of controlled animal substances such as elephant ivory or particular bird populations whose 
importation or exportation is controlled. 

As indicated above, polymers may be used to encode important information on source and batch and supplier. This 

so is described in greater detail, e.g., "^plications of PCR to industrial problems," (1990) in Chemical and Engineering 
News 68:1 45. which is hereby incorporated herein by reference. In fact the synthetic method can be applied to the stor- 
age of enormous amounts of information. Small substrates may encode enormous amounts of information, and its 
recovery wfll make use of the inherent replication capacity. For example, on regions of 10 ^mx 10 nm, 1 cm 2 has 10 6 
regions. An theory, the entire human genome could be attached in 1000 nucleotide segments on a 3 cm 2 surface. 

35 Genomes of endangered species may be stored on these substrates. 

Fingerprinting may also be used for genetic tracing or for identifying individuals fa forensic science purposes. See, 
e.g., Morris, J. et al. (1989) "Bbstatistical Evaluation of Evidence From Continuous Allele Frequency Distribution DNA 
Probes in Reference to Disputed Paternity and Identity," J. Forensic Science 34:1311-1317. and references provided 
therein; each of which is hereby incorporated herein by reference. 

40 In addition, the high resolution fingerprinting allows the cfistinguishability to high resolution of particular samples. 
As indicated above, new cell classifications may be defined based on combinations of a large number of properties. 
Similar applications will be found in distinguishing different species of animals or plants, tn fact microbial identification 
may become dependent or characterization of the genetic content Tumors or other cells exhforting abnormal physiology 
wiD be detectable by use of the present invention. Also, knowing the genetic fingerprint of a microorganism may provide 

45 very useful information on how to treat an infection by such organism. 

Modifications of the fingerprint embodiments may be used to diagnose the condition of the organism. For example, 
a blood sample is presently used for diagnosing any of a number of different physiological conditions. A mufti-dimen- 
sional fingerprinting method made available by the present invention could become a routine means for diagnosing an 
enormous number of physiological features simultaneously. This may revolutionize the practice of medicine in providing 

so information on an enormous number of parameters together at one time. In another way. the genetic predisposition may 
also revolutionize the practice of medicine providing a physician with the ability to predict the likelihood of particular 
medical conditions arising at any particular moment It also provides the ability to apply preventative medicina 

Also available are kits with the reagents usefii for performing sequencing, fingerprinting, and mapping procedures. 
The kits will have various compartments with the desired necessary reagents, e.g., substrate, labeling reagents for tar- 

55 get samples, buffers, and other useful accompanying products. 
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C. Mapping 

The present invention also provides the means for mapping sequences within enormous stretches of sequence. 
For example, nucleotide sequences may be mapped within enormous chromosome size sequence maps. For example, 
s it would be possible to map a chromosomal location within the chromosome which contains hundreds of minions of 
nucleotide base pairs. In addition, the mapping and fingerprinting embodiments allow for testing of chromosomal trans- 
locations, one of the standard problems for which amniocentesis is performed. 

The present invention will be better understood by reference to the following illustrative examples. The following 
examples are offered by way of illustration and not by way of limitation. 
10 Relevant applications whose techniques are incorporated herein by reference are POT publication no. 
WO90/15070, pubfished December 13. 1990; PCT publication no. WO91/07087, published May 30. 1991; U.S.S.N. 
07/624,120. filed December 6. 1990; and U.S.S.N. 07/626,730. filed December 6. 1990. 

Also, additional relevant techniques are described, e.g., in Sambrook, J., et al. (1989) Molecular dg ning- g Labo- 
ratory Manual. 2d Ed., vote 1-3, Cold Spring Harbor Press, New York; Greenstein and Winitz (1961) Chemistry of the 
is Amino Acids. Wiley and Sons. New York; Bodzansky, M. (1988) Peptide Chemistry: a Practical Textbook Springer-Ver- 
lag. New York; Harlow and Lane (1988) Antibodies: A 1 flhnratryy Manual Cold Spring Harbor Press, New \brk; Glover, 

D. (ed.) (1987) DNA Cloning: A Practical Approach, vols 1-3, IRL Press, Oxford; Bishop and Rawlings (1987) Nucleic 
Acid and Protein Sequence Analysis: A Practical Approach. IRL Press, Oxford; Hames and Higgins (1985) NudecApid 
Hybridisation: A Practical Approach. IRL Press, Oxford; Wu et al. (1989) Recornbinant PNA Methodology, Academic 

so Press, San Diego; Goding (1986) Monoclonal Antibodies: Principles and Practice. (2d ed.). Academic Press, San 
Diego; Finegold and Barron (1986) Bailey and Scoffs Diagnostic Microbiology. (7th ed.), Mosby Co., St. Louis; Collins 
et al. (1989) Microbiological Methods. (6th ed.). Butterworth, London; Chaplin and Kennedy (1986) Carbohydrate Anal- 
ysis: A Practical Approach. IRL Press. Oxford; Van Dyke (ed.) (1985) Biduminescence and Chemituminescence: 
Instruments and Applications, vol 1. CRC Press, Boca Rotan; and Ausubel et al. (ed.) (1990) Current Protocols in 

25 Molecular Biology. Greene Publishing and Wiley-lnterscience, New \brk; each of which is hereby incorporated herein 
by reference 

EXAMPLES 

so The following examples are provided to fllustrate the efficacy of the inventions herein. All operations were con- 
ducted at about ambient temperatures and pressures unless indicated to the contrary. 

POLYNUCLEOTIDE SEQUENCING 

35 1 . HPLC of the photolysis of 5*<>-nitroverati7l-thymidine. 

In order to determine the time for photolysis of 5*<0-nrtrovertry1 thymidine to thymidine a 100 uM solution of NV- 
ThynvOH (5'<>-nrtrovertryl thymidine) in dioxane was made and -200 \& aGquots were irradiated (in a quartz cuvette 1 
cm x 2 mm) at 362.3 nm for 20 sec, 40 sec. 60 sec, 2 min, 5 min, 10 mm, 1 5 min, and 20 min. The resulting irradiated 
40 mixtures were then analyzed by HPLC using a Varian MicroPak SP column (C 18 analytical) at a flow rate of 1 rrf/rrin 
and a solvent system of 40% CH3CN and 60% water. Thymidine has a retention time of 15 min and NVOrThym-OH 
has a retention time of 2.1 min. ft was seen that after 10 min of exposure the deprotection was complete. 

2. Preparation and Detection of Thymidine-Cytidine <f mer (FITC) 

45 

The reaction Is illustrated: 



so 
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To an aminopropyiated glass sGde (standard VLSIPS) was added a mixture of the following: 

12.2 mg of NVOTTiym-COgH (IX) 

3.4 mg of HOBT (N-trydroxybenztriazal) 
8.8 uJ DIEA (DiisopropyletfTylamine) 
11.1 mg BOP reagent 

2.5 ml DMF 

After 2 h coupfing time (standard VLSIPS) the plate was washed, acetylated with acetic anhydridefcyridine. 
washed, dried, and photoryzed in dioxane at 362 run at 14 mW/cm 2 for 10 min using a 500 urn checkertx>ard mask. The 
slide was then taken and treated with a mixture of the following: 

107 mg of FMOC-amine modified C (III) 

21 mg oftetrazole 

1 ml anhydrous CH3CN 

After being treated for approximately 8 min. the elide was washed off with CH3CN, dried, and oxidized with 
I^C/THF/tufidine for 1 min. The slide was again washed, dried, and treated for 30 min with a 20% solution of DBU in 
DMF. After thorough rinsing of the slide, ft was next exposed to a FITC solution (1mM fluorescein isothiocyanate [FITC] 
in DMF) for 50 min, then washed, dried, and examined by fluorescence microscopy. This reaction is illustrated: 
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ss 3. Preparation and Detection of Thymkjine-Cytidine cfmer (Biotin) 

An aminopropyi glass slide, was soaked in a solution of ethylene oxide (20% in DMF) to generate a hydroxytated 
surface. The slide was added a mixture of the following: 

40 32 mg of NVO-T-OCED (X) 
11 mg oftetrazole 
0.5 ml of anhydrous CH3CN 

After 8 min the plate was then rinsed with acetonitrOe, then oxidized with Ig/HgO/THF/lutidine for 1 min, washed and 
45 dried. The slide was then exposed to a 1 3 mixture of acetic anhydridepyridine for 1 h. then washed and dried. The sub- 
strate was a then photolyzed in dioxane at 362 nm at 14 mW/cm 2 for 10 min using a 500|im checkerboard mask, dried, 
and then treated with a mixture of the foDowing: 

65 mg of biotin modified C (IV) 
60 nmg oftetrazole 

0.5 ml anhydrous CH3CN 

After 8 min the slide was washed with CH3CN then oxidized with I^O/THF/lutidine for 1 min, washed, and then 
dried. The slide was then soaked for 30 min in a PBS/0.05% Tween 20 buffer and the solution then shaken off. The slide 
55 was next treated with FITC-labeled streptavidin at 10 ng/ml in the same buffer system for 30 min. After this time the 
streptavkjin -buffer system was rinsed off with fresh PBSA).05% Tween 20 buffer and then the slide was finally agitated 
in distilled water for about 1/2 h. After drying, the 6lide was examined by fluorescence microscopy (see Fig. 2 and Fig. 
3). 



42 




EP 0 834 575 A2 



4. substrate preparation 

Before attachment of reactive groips rt is preferred to clean the substrate which is, in a preferred embodiment a 
glass substrate such as a microscope slide or cover slip. A roughened surface will be useable but a plastic or other solid 
s substrate is also appropriate. According to one embodiment the slide is soaked in an alkaline bath consisting of. e.g., 1 
liter of 95% ethanol with 120 ml of water and 120 grams of sodium hydroxide for 12 hours. The slides are washed with 
a buffer and under running water, allowed to air dry, and rinsed with a solution of 95% ethanol. 

The slides are then aminated with, e.g., aminopropyttriethoxysilane for the purpose of attaching amino groups to 
the glass surface on linker molecules, although other omega functionalized eilanes could also be used for this purpose. 
w In one embodiment 0.1% aminopropyttriethoxysilane is utilized, although solutions with concentrations from 10~ 7 % to 
10% may be used, with about 10^ to 2% preferred. A 0.1% mixture is prepared by adding to 100 ml of a 95% etna- 
nol/5% water mixture, 100 microliters fjj} of aminopropyltriethoxysilana The mixture is agitated at about ambient tem- 
perature on a rotary shaker for an appropriate amount of time, e.g., about 5 minutes. 500 pi of this mixture is then 
appGed to the surface of one side of each cleaned slide After 4 minutes or more, the slides are decanted of this solution 
is and thoroughly rinsed three times or more by dipping in 1 00% ethanol. 

After the slides dry, they are heated in a 1 10-120°C vacuum oven for about 20 minutes, and then allowed to cure 
at room temperature for about 12 hours in an argon environment The slides are then dipped into DMF (dimethyfforma- 
mide) solution, followed by a thorough washing with methylene chloride. 

20 5. linker attachment, blocking of free sites 

The aminated surface of the slide is then exposed to about 500 jd of, for example, a 30 miflimolar (mM) solution of 
NVOC-ruideotide- NHS (N^iydroxysucdnimide) in DMF for attachment of a NVOC-nudeotide to each of the amino 
groups. See, eg., SIGMA Chemical Company for various nucleotide derivatives. The surface is washed with, for exam- 
25 pie, DMF, methylene chloride, and ethanol. 

Any unreacted aminopropyl si lane on the surface, i.e, those amino groups which have not had the NVOC-nude- 
otide attached, are now capped with acetyl groups (to prevent further reaction) by exposure to a 1:3 mixture of acetic 
anhydride in pyridine for 1 hour. Other materials which may perform this residual capping function include trifluoroacetic 
anhydride, formicacetic anhydride, or other reactive acyiating agents. Finally, the slides are washed again with DMF, 
so methylene chloride, and ethanol. 

6. synthesis of eight trimers of C and T 

Rg. 4 illustrates a possible synthesis of the eight trimers of the two-monomer set: cytosine and thymine (repre- 
ss sented by C and T, respectively). A glass slide bearing silane groups terminating in 6Hfiitroveratryk>xycarbOKarr^ 
(NVOC-NH) residues is prepared as a substrata Active esters (pentafluorophenyl, OBt, eta) of cytosine and thymine 
protected at the 5* hydraxyl group with NVOC are prepared as reagents. While not pertinent to this example, H side 
chain protecting groups are required for the monomer set, these must not be photoreactive at the wavelength of light 
used to protect the primary chain. 
40 For a monomer &et of size n, n x / cycles are required to synthesize all poss&e sequences of length L A cycle con- 
sists of: 

1 . Irradiation through an appropriate mask to expose the 5*-OH groups at the sites where the next residue is to be 
added, with appropriate washes to remove the by-products of the deprotectioa 
45 2. Addition of a 6ingle activated and protected (with the same photochemicaOy^emovable group) monomer, which 
win react only at the 6ites addressed in step 1 , with appropriate washes to remove the excess reagent from the sur- 
face. 

The above cycle is repeated for each member of the monomer set until each location on the surface has been 
60 extended by one residue in one embodiment In other embodiments, several residues are sequentially added at one 
location before moving on to the next location. Cycle times wiD generally be limited by the coupling reaction rate, now 
as short as about 10 min in automated oligonucleotide synthesizers. This 6tep is optionally followed by addition of a pro- 
tecting group to stabilize the array for later testing. For some types of polymers (e.g., peptides), a final deprotection of 
the entire surface (removal of photoprotective 6ide chain groups) may be required. 
55 Mae particularly, as shown in Rg. 4A, the glass 20 is provided with regions 22, 24, 26, 28, 30, 32, 34, and 36. 
Regions 30, 32, 34, and 36 are masked, indicated by the hatched regions, as shown in Rg. 4B and the glass is irradi- 
ated by the bright regions 22, 24, 26, and 28, and exposed to a reagent containing a photosensitive blocked C (e.g., 
cytosine derivative), with the resulting structure shown in Rg. 4C. The substrate is carefully washed and the reactants 
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removed. Thereafter, regions 22, 24, 26, and 28 are masked, as indicated by the hatched region, the glass is irradiated 
(as shown in Fig. 4D). as indicated by the bright regions, at 30, 32, 34, and 36. and exposed to a photosensitive blocked 
reagent containing T (eg., thymine derivative), with the resulting structure shown in Fig. 4E. The process proceeds, 
consecutively masking and exposing the sections as shown until the structure shown in Fig. 4M is obtained. The glass 
is irradiated and the terminal groups are, optionally, capped by acetytation. As shown, all possible timers of cyto- 
sine/thymine are obtained. 

In this example, no side chain protective group removal is necessary, as night be common in modified nucleotides. 
If it is desired, side chain deprotection may be accomplished by treatment with ethanedithiol and trifluoroacetic acid. 
In general, the number of steps needed to obtain a particular polymer chain is defined by: 

nx/ (1) 

where: 

is n = the number of monomers in the basis set of monomers, and 
/ o the number of monomer units in a polymer chain. 

Conversely, the synthesized number of sequences of length / will be: 

so n'. (2) 

Of course, greater diversity is obtained by using masking strategies which wQI also include the synthesis of poly- 
mers having a length of less than /. If, in the extreme case, all polymers having a length less than or equal to / are syn- 
thesized, the number of polymers synthesized will be: 



25 



n' + n'-V-.+n 1 . (3) 



The maximum number of lithographic steps needed will generally be n for each layer" of monomers, i.e.. the total 
number of masks (and, therefore, the number of lithographic steps) needed will be n x /. The size of the transparent 
so mask regions will vary in accordance with the area of the substrate available for synthesis and the number of sequences 
to be formed. In general, the size of the synthesis areas will be: 

size of synthesis areas = (A)/(S) 

35 where: 

A is the total area available for synthesis; and 
S is the number of sequences desired in the area. 

40 ft will be appreciated by those of skill in the art that the above method could readily be used to simultaneously pro- 
duce Ihousands or millions of oligomers on a substrate using the photolithographic techniques disclosed herein. Con- 
sequently, the method results in the ability to practically test large numbers of, for example, di, tri, tetra, penta, hexa, 
hepta, octa, nona, deca, even dodecanucleotides, or larger polynucleotides. 

The above example has illustrated the method by way of a manual example. It will of course be appreciated that 

45 automated or semi-automated methods could be used. The substrate would be mounted in a flow cell for automated 
addition and removal of reagents, to minimize the volume of reagents needed, and to more carefully control reaction 
conditions. Successive masks wfll be applicable manually or automatically. See, e.g., PCT publication no. WO90/15070 
and U.S.&N. 07/624,120. 

so 7. labeling of target 

The target oligonucleotide can be labeled using standard procedures referred to above. As discussed, for certain 
situations, a reagent which recognizes interaction, eg., ethidium bromide, may be provided in the detection stepi Alter- 
natively, fluorescence labeling techniques may be applied, see, e.g., Smith, et al. (1986) Nature. 321: 674-679; and 
55 Prober, et al. (1987) Science. 238:336-341. The techniques described therein will be followed with minimal modifica- 
tions as appropriate for the label selected. 
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8. dimers of A, C, G, and T 

The described technique may be applied, with photosensitive blocked nucleotides corresponding to adenine, cyto- 
sine, guanine, and thymine, to make combinations of polynucleotides consisting of each of the four different nude- 
5 otkJes. All 1 6 possible dimers would be made using a minor modification of the described method. 

9. 10-mers of A, C. G, and T 

The described technique for making dimers of A, C, G, and T may be further extended to make longer oligonucle- 
io otides. The automated system described, e.g.. in PCT pubBcation no. WO90/15070, and U.S.S.N. 07/624,120, can be 
adapted to make all possible 10-mers composed of the 4 nucleotides A, C, G. and T The photosensitive, blocked nucle- 
otide analogues have been described above, and would be readily adaptable to longer oligonucleotides. 

10. specific recognition hybridization to 10-mers 

15 

The described hybridization conditions are directly applicable to the sequence specific recognition reagents 
attached to the substrate, produced as described immediately above The 10-mers have an inherent property of hybrid- 
izing to a complementary sequence. For optimum discrimination between fuil-matching and some mismatch, the con- 
ditions of hybridization should be carefully selected, as described above. Careful control of the conditions, and titration 
20 of parameters should be performed to determine the optimum collective conditions. 

11. hybri<fization 

Hybridization conditions are descrtoed in detail, e.g., in Hames and Higgins (1985) Nucleic Acid Hybridisation: A 
25 Practical Approach; and the considerations for selecting particular conditions are descrfoed, e.g., in Wetmur and Dav- 
idson, (1988) J. Mol. Biol. 31:349-370. and Wood et al. (1985) Proc. NaH. Acad. Sci. USA 82:1 585-1 588. As described 
above, conditions are desired which can distinguish matching along the entire length of the probe from where there is 
one or more mismatched bases. The length of incubation and conditions will be similar, in many respects, to the hybrid- 
ization conditions used in Southern blot transfers. Typically, the GC bias may be minimized by the introduction of appro- 
so priate concentrations of the alkylammonium buffers, as described above. 

Titration of the temperature and other parameters is desired to determine the optimum conditions for specif ictty and 
distingutshabilrty of absolutely matched hybridization from mismatched hybridization. 

A fluorescently labeled target or set of targets are generated, as described in Prober, et al. (1987) Science 
238335-341 , or Smith, et al. (1 986) Nature 321 £74-679. Preferably, the target or targets are of the same length as, or 
ss slightly longer, than the oligonucleotide probes attached to the substrate and they wfll have known sequences. Thus, 
only a few of the probes hybridize perfectly with the target, and which particular ones did would be known. 

The substrate and probes are incubated under appropriate conditions for a sufficient period of time to allow hybrid- 
ization to completion. The time is measured to determine when the probe-target hybridizations have reached comple- 
tion. A salt buffer which minimizes GC bias is preferred, incorporating, eg., buffer, such as tetramethyl ammonium or 
40 tetraethyt ammonium ion at between about 2.4 and 3.0 M. See Wood, et al. (1 985) Pros. Natl Acad, Sti, USA 82:1585- 
1588. This time is typicaDy at least about 30 min, and may be as long as about 1-5 days. Typically very long matches 
wiO hybridize more quickly, very short matches win hybridize less quickly, depending upon relative target and probe con- 
centrations. The hybridization wfll be performed under conditions where the reagents are stable for that time duration. 
Upon maximal hybridization, the conditions for washing are titrated. Three parameters initially titrated are time, 
45 temperature, and cation concentration of the wash step. The matrix is scanned at various times to determine the con- 
ditions at which the distinguishability between true perfect hybrid and mismatched hybrid is optimized. These conditions 
wiO be preferred in the sequencing embodiments. 

12. positional detection of specific interaction 

60 

As indicated above, the detection of specific interactions may be performed by detecting the positions where the 
labeled target sequences are attached. Where the label is a fluorescent label, the apparatus described, eg., PCT pub- 
lication no. WO90/15070; and U.S.S.N. 07/624,120. may be advantageously applied. In particular, the synthetic proc- 
esses described above will result in a matrix pattern of specific sequences attached to the substrate, and a known 
55 pattern of interactions can be converted to corresponding sequences. 

In an alternative embodiment, a separate reagent which differentially interacts with the probe and interacted 
probe/targets can indicate where interaction occurs or does not occur. A 6ingle-strand specific reagent will indicate 
where no interaction has taken place, while a double-strand specific reagent will indicate where interaction has taken 



45 



# # 

EP 0 834 575 A2 

place. An intercalating dye, e g., ethidrum bromide, may be used to indicate the positions of specific interaction. 
13. analysis 

b Conversion of the positional data into sequence specificity will provide the set of subsequences whose analysis by 
overlap segments, may be performed, as descrfoed above. Analysis is provided by the methodology descrfoed above, 
or using, e.g., software available from the Genetic Engineering Center, RO. Box 794, 11000 Belgrade, Yugoslavia 
(Yugoslav group). See. also, Macevicz. PCT publication no. WO 90/04652. which is hereby incorporated herein by ref- 
erence. 

io The description of the preparation of short peptides on a substrate incorporates by reference sections in U.S.S.N. 
07/492,462 (VLSIPS CIP), and described below. 

POLYNUCLEOTIDE FINGERPRINTING 

is The above section on generation of reagents for sequencing provides specific reagents useful for fingerprinting 
applications. Fingerprinting embodiments may be applied towards polynucleotide fingerprinting, cell and tissue classi- 
fication, cell and tissue temporal development stage classification, diagnostic tests, forensic uses for individual identifi- 
cation, classffication of organisms, and genetic screening of individuals. Mapping applications are also described 
below. 

20 Polynucleotide fingerprinting may use reagents similar to those described above for probing a sequence for the 
presence of specific subsequences found therein. Typically, the subsequences used for fingerprinting will be longer 
than the sequences used in oligonucleotide sequencing. In particular, specif ic long segments may be used to determine 
the simaarity of different samples of nucleic adds. They may also be used to fingerprint whether specific combinations 
of information are provided therein. Particular probe sequences are selected and attached In a positional manner to a 

25 substrate. The means for attachment may be either using a caged biotin method descrfoed, e.g., in U.S.S.N. 
07/612,671 (caged biotin CIP), or by another method using targeting molecules. In one embodiment an unnatural 
nucleotide or similar complementary binding molecule may be attached to the fingerprinting probe and the probe 
thereby directed towards complementary sequences on a VLSIPS substrate. Typically, unnatural nucleotides would be 
preferred, e.g., unnatural optical isomers, which would not interfere with natural nucleotide interactions. 

so Having produced a substrate with particular fingerprint probes attached thereto at positionally defined regions, the 
substrate may be used in a manner quite similar to the sequencing embodiment to provide information as to whether 
the fingerprint probes are delecting the corresponding sequence in a target sequence. This will often provide informa- 
tion similar to a Southern blot hybridization. 

as Temporal Development 

Developmental RNA expression patterns 

The present fingerprinting invention also allows cell classification by identification of developmental RNA expres- 
40 sion patterns. For example, a lymphocyte stem cell expresses a particular combination of RNA species. As the lym- 
phocyte develops through a program developmental scheme, at various stages it expresses particular RNA species 
which are diagnostic of particular stages in development. Again, the f ingerprinting methodology allows for the definition 
of specffic structural features which are cfiagnostic of developmental or functional features which will allow classification 
of cells into temporal developmental classes. Cells, products of those cells, or lysates of those cells will be assayed to 
45 determine the developmental stage of the source ceils. In this manner, once a developmental stage is defined, specific 
synchronized populations of cells will be selected out of another population. These synchronized populations may be 
very important in determining the biological mechanisms of development 

The present invention also allows for f ingerprinting of the mRNA population of a cell. In this fashion, the mRNA pop- 
ulation, which should be a good determinant of developmental stage, will be correlated with other structural features of 
60. the cell. In this manner, cells at specific developmental stages will be characterized by the Intracellular environment as 
well as the extracellular environment 

Diagnostic Tests 

55 The present invention also provides the ability to perform diagnostic tests. Diagnostic tests typically are based upon 
a fingerprint type assay, which tests for the presence of specific diagnostic polynucleotides. Thus, the present invention- 
provides means for viral strain identification, bacterial strain identification, and other diagnostic tests using positionally 
defined specific oligonucleotide reagents. 
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Viral Identification 

The present invention provides reagents and methodology for identifying viral strains. The viral genome may be 
probed for specific sequences which are characteristic of particular viral strains. Specific hybridization patterns on an 
s VLS1PS oligonucleotide substrate can Identify the presence of particular viral genomes. 

Bacterial Identification 

Similar techniques will be applicable to identifying a bacterial sourca This may be useful in diagnosing bacterial 
10 infections, or in classifying sources of particular bacterial species. For example, the bacterial assay may be useful in 
determining the natural range of survivability of particular strains of bacteria across regions of the country or in different 
ecological niches. 

Other Microbiological Weffljfcations 

15 

The present invention provides means for diagnosis of other microbiological and other species, e.g., protozoal spe- 
cies and parasitic species in a biological sample, but also provides the means for assaying a combination of different 
infections. For example, a biological specimen may be assayed for the presence of any or all of these microbiological 
species. In human diagnostic uses, typical samples will be blood, sputum, stool, urine, or other samples. 

20 

Individual Identification 

The present invention provides the ability to fingerprint and identify a genetic individual. This individual may be a 
bacterial or lower microorganism, as described above in diagnostic tests, or of a plant or animal. An individual may be 

25 identified genetically, as described. 

Genetic fingerprinting has been utilized in comparing different related species in Southern hybridization blots. 
Genetic fingerprinting has also been used in forensic studies, see. e.g., Morris et al. (1989) J. Forensic Science 34: 
1311*1317, and references cited therein. As described above, an individual may be identified genetically by a suffi- 
ciently large number of probes. The likelihood that another individual would have an identical pattern over a sufficiently 

so large number of probes may be statistically negtigtole. However, it is often quite important that a large number of probes 
be used where the statistical probability of matching Is desired to be particularly low. In fact the probes will optimally be 
selected for having high heterogeneity among the population. In addition, the fingerprint method may make use of the 
pattern of homologies indicated by a series of more and more stringent washes. Then, each position has both a 
sequence specificity and a homology measurement, the combination of which greatly increases the number of cfimen- 

35 sions and the statistical likelihood of a perfect pattern match with another genetic individual. 

Cfrngfc Screening 

1. test alleles with markers 

40 

The present invention provides for the ability to screen for genetic variations of individuals. For example, a number 
of genetic diseases are linked with specific alleles. Sea e.g., Scriber, C. et al. (eds.) (1989) The Metabolic Bases of 
Inherited Disease. McGraw-Hill, New \bric In one embodiment cystic fixosis has been correlated with a specific gene, 
see, Gregory et al. (1990) Nature 347: 382-386. A number of alleles are correlated with specific genetic deficiencies. 
45 See. eg., McKusicK v. (1990) genetic inheritance in Man; Catalog? pf Autosomal Porpinant Artreorrei Recessive, 
and X-linked Phenotyp es. Johns Hopkins University Press, Baltimore; Ott J. (1985) Analysis of Human Genetic Link- 
flg& Johns Hopkins University Press, Baltimore; Track, R. et al. (1989) Banbury Report 32: DNA Technology and 
Forensic Science. Cold Spring Harbor Press, New \brk; each of which is hereby incorporated herein by reference. 

so 2. Amniocentesis 

Typically, amniocentesis is used to determine whether chromosome translocations have occurred. The mapping 
procedure may provide the means for determining whether these translocations have occurred, and for detecting par- 
ticular alleles of various markers. 

55 
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MAPPING 

Positionally Located Clones 

6 The present invention aDows for the positional location of specific clones useful for mapping. For exanrple, caged 
biotin may be used for specifically positioning a probe to a location on a matrix pattern. 

In addition, the specific probes may be positionally directed to specific locations on a substrate by targeting. For 
example, polypeptide specific recognition reagents may be attached to oligonucleotide sequences which can be com- 
plementarity targeted, by hybridization, to specific locations on a VLSIPS substrate. Hybridization conditions, as applied 
10 for oligonucleotide probes, will be used to target the reagents to locations on a substrate having complementary oligo- 
nucleotides synthesized thereon. In another embodiment, oligonucleotide probes may be attached to specific polypep- 
tide targeting reagents such as an antigen a antibody. These reagents can be cfirected towards a complementary 
antigen a antibody already attached to a VLSIPS substrata 

In another embodiment an unnatural nucleotide which does not interfere with natural nucleotide complementary 
15 hybridization may be used to target oligonucleotides to particular positions on a substrate. Unnatural optical isomers of 
natural nucleotides should be ideal candidates. 

In this way, short probes may be used to determine the mapping of long targets or long targets may be used to map 
the position of shorter probes. See, eg.. Craig et a). 1990 Nuc. Acids Res. 18: 2653-2660. 

20 Positionally Defined Clones 

Positionally defined dones may be transferred to a new substrate by either physical transfer or by synthetic means. 
Synthetic means may involve either a production of the probe on the sifcstrate using the VLSIPS synthetic methods, or 
may Involve the attachment of a targeting sequence made by VLSIPS synthetic methods which will target that position- 
25 ally defined clone to a position on a new substrate. Both methods will provide a substrate having a number of position- 
ally defined probes useful in mapping. 

CONCLUSION 

so The present inventions provide greatly improved methods and apparatus for synthesis of polymers on substrates, 
ft is to be understood that the above description is intended to be illustrative and not restrictive. Many embodiments will 
be apparent to those of skin in the art upon reviewing the above description. By way of example, the invention has been . 
described primarily with reference to the use of photoremovable protective groups, but it will be readily recognized by 
those of skiQ in the art that sources of radiation other than fight could also be used. For example, in some embodiments 

35 ft may be desirable to use protective groups which are sensitive to electron beam irradiation, x-ray irradiation, in combi- 
nation with electron beam lithograph, or x-ray lithography techniques. Alternatively, the group could be removed by 
exposure to an electric current The scope of the invention should, therefore, be determined not with reference to the 
above description, but should instead be determined with reference to the appended claims, along with the fuD scope 
of equivalents to which such claims are entitled. 

40 Afl publications and patent applications referred to herein are incorporated by reference to the same extent as if 
each individual publication or patent application was specifically and individually incorporated by reference. The present 
invention now being fully described, it will be apparent to one of ordinary GkiQ in the art that many changes and modifi- 
cations can be made thereto without departing from the spirit or scope of the appended claims. 

45 Claims 

1 . A method for distinguishing two nucleic adds samples comprising: 

(a) provkfing a substrate having at least 100 different probes in Mown locations, 
60 (b) hybridizing a first nucleic acid sample to the substrate to obtain a first hybridization pattern, 

(c) hybridizing a second nucleic acid sample to the substrate to obtain a second hybridization pattern, and 
.(d) comparing the first hybridization pattern with the second hybridization pattern, therebydistinguishing the 
first nucleic acid sample from the second nucleic acid sample. 

55 2. The method of claim 1 , wherein at least part of the sequence of each of the probes is known. 

3. The method of claim 1 , wherein at least 6ome of the probes are oligonucleotides. 
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4. The method claim 1 , wheren the difference is detected by comparing a first partial nucleotide sequence generated 
from the first hybridization pattern with a second partial nucleotide sequence generated from the second hybridiza- 
tion pattern. 

5 5. A method for determining whether a test sample of nucleic acid is from an individual comprising: 
(a) providing a reference sample of nucleic acid 

from the individual; (b) providing a substrate having a plurality of different nucleic acids in known locations; 

(c) applying the reference sample to the substrate to obtain a reference hybridization pattern; 

(d) applying the test sample to the substrate to obtain a test hybridization pattern; and 

(e) comparing the reference hybridization pattern with the test hybricEzation pattern to determine whether the 
test sample is from the individual. 

A method for identifying a nucleic acid sample comprising: 

(a) providing a substrate having a plurality of different probes in known-locations; 

(b) applying the nucleic acid sample to the substrate to obtain a hybridization pattern; and 

(c) comparing the hybridization pattern with a reference database to identify the nucleic acid sample. 

20 7. The method of claim 6, wherein at least part of the sequence of each of the probes is known. 

8. The method of daim 6. wherein at least some of the probes are oligonucleotides. 

9. The method of claim 6. wherein the hybridization pattern is analyzed to generate a partial nucleotide sequence for 
26 the nucleic acid sample, and the partial nucleotide sequence is compared with a reference nucleotide sequence in 

the reference database. 

10. A method for identifying a sample of nucleic acid comprising: 

so (a) providing the sample; 

(b) providing a substrate having a plurality of different probes in known locations. 

(c) applying the sample to the substrate to obtain a hybridization pattern; and 

(d) comparing the hybridization pattern to a reference database of a plurality of sources of nucleic acid, thereby 
identifying the source of the sample. 

55 

1 1 . The method of claim 1 0, wherein at least part of the sequence of each of the probes is known. 

12. The method of claim 10. wherein at lease some of the probes are oligonucleotides. 

40 13. The method of claim 10, wherein the hybridization pattern is analyzed to generate a partial nucleotide sequence, 
and the partial nucleotide sequence is compared with a reference nucleotide sequence in the reference database. 

14. The method of claim 10, wherein the sample is from an individual, and the plurality of sources of nucleic acid are 
provided by potential relatives of the individual thereby identifying a genealogy of the individual. 

45 

15. The method of claim 10, wherein the sample is provided by an organism, and the sources of nucleic acid are pro- 
vided by a plurality of known individuals, thereby identifying the organism as one of the known individuals. 

16. A method of marking a plurality of samples comprising: 

60 

(a) providing a plurality of markers, each marker comprised of a different polymer encoded by a unique 
sequence; and 

(b) including a different marker in each sampla 

55 17. The method of claim 16, wherein each marker consists essentially of between 10 and 1000 molecules of polymer. 
18. The method of claim 16, wherein the polymers are nucleic acids. 
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19. The method of claim 18 further comprising: 

(c) identifying the sample by nucleotide sequence determination. 
6 20. The method of claim 19. wherein the nucleic acids are amplified prior to nucleotide sequence determination. 
21. The method of claim 16 t wherein the marker does not have biological activity. 
22- A method of decoding a nucleic acid marker comprising: 

10 

(a) providing a substrate having a plurality of different probes in known locations; 

(b) applying the nucleic acid marker to the substrate under hybridization conditions; and 

(c) detecting complementarity of the nucleic acid marker to the plurality of different probes, thereby identifying 
the nucleic acid marker as comprising a nucleotide sequence of the complementary probe. 

15 

23. A method for detecting genetic variation among different individuals comprising: 

(a) providing a sample from each of a plurality of individuals, wherein the sample contains nucleic acid of the 
individual; 

20 (b) providing a substrate having a plurality of probes in known locations; 

(c) applying each sample to the substrate to obtain a hybridization pattern; and 

(d) comparing the hybridization patterns obtained to detect genetic variation. 

24. A method of detecting genetic variation related to a disease comprising: 

25 

(a) providing samples of nucleic acid from each of a plurality of individuals, wherein at least some individuals 
have a high genetic propensity for the disease and at least some other individuals have low genetic propensity 
for the disease; 

(b) providing a substrate having a plurality of probes in known locations; 

so (c) applying each sample to the substrate to obtain a hybridization pattern; and 

(d) correlating the hybridization patterns obtained to high genetic propensity for the disease. 

25. A method for detecting a mutation in an individual comprising: 

ss (a) providing a sample of nucleic acid from the individual; 

(b) providing a substrate having a plurality of oligonucleotides in known locations, wherein at least some oligo- 
nucleotides have mutant sequence and at least some other oligonucleotides have wild type sequence; 

(c) applying each sample to the substrate to obtain a hybridization pattern; and 

(d) detecting mutation In the individual by analysis of the hybridization pattern. 

40 

26. A method for diagnosing an abnormal tissue comprising: 

(a) providing a sample from the abnormal tissue, wherein the sample contains transcripts of the abnormal tis- 
sue; 

45 (b) providing a substrate having a plurality of different probes in known locations; 

(c) applying the sample to the substrate to obtain an expression pattern; and 

(d) comparing the egression pattern to a reference database having expression patterns for a plurality of 
known abnormalities of tissue, thereby diagnosing the abnormal tissue. 

so 27. A method for determining a cellular composition of a tissue comprising: 

(a) providing a sample from the tissue, wherein the sample contains nucleic acid of the tissue; 

(b) providing a substrate having a plurality of probes in known locations; 

(c) applying the sample to the substrate to obtain an expression pattern; and 

65 (d) comparing the egression pattern to a reference database having expression patterns for a plurality of 

known cells, thereby determining the cellular composition of the tissua 

28. A method for identifying a microbe comprising: 
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(a) providing a sample from the microbe, wherein the sample contains nucleic add of the microbe; 

(b) providing a substrate having a plurality of probes in known locations; 

(c) applying the sample to the substrate to obtain a hybridisation pattern; and 

(d) comparing the hybridization pattern with a reference database to identify the microbe. 

29. A method for detecting genetic variation among different isolates of a microbe comprising: 

(a) providing a sample from each of a plurality of isolates, wherein the sample contains nucleic acid of the iso- 
late; 

(b) providing a substrate having a plurality of probes in known locations; 

(c) applying each sample to the substrate to obtain a riybridization pattern; and 

(d) comparing the hybridization patterns obtained to detect genetic variation. 

30. The method of claim 29, wherein the microbe is selected from the group consisting of protozoa, virus and bacteria. 

31 . A method of information storage and recovery comprising: 

(a) encoding the information as a polymer sequence; 

(b) storing the information by synthesis of the polymer sequence; and 

(c) recovering the information by replication of the polymer. 

32. The method of claim 31 further comprising: 

(d) determining the polymer sequence to provide decoded information. 

33. The method of claim 31 , wherein the polymer is a nucleic acid. 

34. The method of claim 31 , wherein the polymer is an oligonucleotide. 

35. The method of claim 31 . wherein the polymer is synthesized on a substrate. 

36. The method of daim 31 , wherein the data is stored on a substrate at a density of at least 10 3 different polymers per 
cm 2 of the substrate. 

37. TTie method of daim 31 , wherein the data is stored oh a substrate at a density of at least 1 0 4 different polymers per 
cm 2 of the substrate. 

38. The method of daim 31 , wherein the data is stored on a substrate at a density of at least 10 s different polymers per 
cm 2 of the substrate. 

39. The method of daim 31 , wherein the data is stored on a substrate at a density of at least 1 0 6 different polymers per 
cm 2 of the substrate. 

40. A method for preparing a reagent matrix comprising: 

(a) providing a plurality of first targeting elements at specific locations on a substrate, wherein at least 1 0 3 dif- 
ferent first targeting elements per cm 2 are located on the substrate; 

(b) providing a second targeting element capable of specific binding with at least one of the first targeting ele- 
ments, wherein the second targeting element is linked to a reagent; and 

(c) contacting the second targeting element to the substrate under conditions allowing specific binding, and 
thereby preparing the reagent matrix. 

41 . The method of daim 40, wherein the specific binding is selected from the group consisting of nucleic acid hybridi- 
zation, antigen-antibody binding, and biotin-avidin binding. 

42. The method of claim 40, wherein the reagent is a polymer. 

43. The method of daim 40, wherein the reagent is a polypeptide. 
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44. The method of claim 43. wherein the polypeptide is a drug receptor. 

45. The method of claim 43, wherein the polypeptide is an anttoody or an antigen. 
5 46. The method of claim 40. wherein the reagent is a nucleic acid vector. 

47. The method of claim 40, wherein the reagent is a nucleic acid. 

48. The method of claim 40. wherein the reagent is an oligonucleotide. 

10 

49. The method of claim 40. wherein at least one first targeting element is synthesized on the substrate. 

50. The method of daim 40, wherein at least 10 4 first targeting elements per cm 2 are located on the substrate. 
is 51 . The method of claim 50, wherein at least 10 s first targeting elements per cm 2 are located on the substrate. 

52. The method of claim 50, wherein at least 10 6 first targeting elements per cm 2 are located on the substrate. 

53. The method of claim 40 wherein at least 10 4 different first targeting elements are located on the substrate. 

20 

54. The method of claim 53, wherein at least 10 5 different first targeting elements are located on the substrate. 

55. The method of claim 54, wherein at least 10 6 different first targeting elements are located on the substrate. 

25 56. The method of claim 40, wherein the substrate is contacted by direct transfer of a colony or plaque containing the 
second targeting element 

57. The method of daim 40, wherein the first and second targeting elements are nucleic acids that do not specifically 
bind with natural nucleic acids. 

so 

35 
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